In this article, we’ll explore how to merge two DataFrames based on multiple columns, ensuring seamless integration of datasets. Merging DataFrames on multiple columns using Pandas is a foundational skill in data analysis.
Sample data for hands-on learning
import pandas as pd
# First DataFrame
data1 = {
'ID': [1, 2, 3, 4, 5],
'Name': ['Sachin', 'Ram', 'Abhilash', 'Mike', 'Elaine'],
'Age': [30, 25, 28, 40, 35]
}
df1 = pd.DataFrame(data1)
# Second DataFrame
data2 = {
'ID': [3, 4, 5, 6, 7],
'Name': ['Abhilash', 'Mike', 'Elaine', 'Jasmine', 'Robert'],
'Score': [85, 90, 78, 88, 79]
}
df2 = pd.DataFrame(data2)
print(df1)
print(df2)
Output
ID Name Age
0 1 Sachin 30
1 2 Ram 25
2 3 Abhilash 28
3 4 Mike 40
4 5 Elaine 35
ID Name Score
0 3 Abhilash 85
1 4 Mike 90
2 5 Elaine 78
3 6 Jasmine 88
4 7 Robert 79
Merging DataFrames on multiple columns
The primary method to merge DataFrames is the merge function. When merging on multiple columns, provide the column names as a list.
merged_df = pd.merge(df1, df2, on=['ID', 'Name'], how='inner')
print(merged_df)
Output
ID Name Age Score
0 3 Abhilash 28 85
1 4 Mike 40 90
2 5 Elaine 35 78
In the above code:
on
specifies the columns on which the merge operation is based.
how
defines the type of merge. The ‘inner’ merge retains only rows with matching values in both DataFrames.
Understanding different types of merges
While ‘inner’ merge is common, there are other types you should be aware of:
Inner: Only rows with matching values in both DataFrames are retained.
Left: All rows from the left DataFrame and matching rows from the right DataFrame are retained. NaN for non-matching rows.
Right: All rows from the right DataFrame and matching rows from the left DataFrame are retained. NaN for non-matching rows.
Outer: All rows from both DataFrames are retained, with NaN in places where no match is found.
# Example of a left merge
left_merged_df = pd.merge(df1, df2, on=['ID', 'Name'], how='left')
print(left_merged_df)
Output
ID Name Age Score
0 1 Sachin 30 NaN
1 2 Ram 25 NaN
2 3 Abhilash 28 85.0
3 4 Mike 40 90.0
4 5 Elaine 35 78.0
Potential pitfalls and solutions
Suffixes: When columns, other than the ones you’re merging on, have the same name, Pandas appends suffixes. Use the suffixes
parameter to customize these.
custom_merged_df = pd.merge(df1, df2, on=['ID', 'Name'], how='inner', suffixes=('_Left', '_Right'))
print(custom_merged_df)
Output
ID Name Age Score
0 3 Abhilash 28 85
1 4 Mike 40 90
2 5 Elaine 35 78
Mismatched data types: Ensure columns you’re merging on have consistent data types across both DataFrames.
Missing values: Rows with NaN in the merge columns will not be matched. Ensure data consistency before merging.
Refer more on python here : Python
Refer more on python here : PySpark