Merging dataFrames on multiple columns with Python Pandas

Python Pandas @ Freshers.in

In this article, we’ll explore how to merge two DataFrames based on multiple columns, ensuring seamless integration of datasets. Merging DataFrames on multiple columns using Pandas is a foundational skill in data analysis.

Sample data for hands-on learning

import pandas as pd
# First DataFrame
data1 = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['Sachin', 'Ram', 'Abhilash', 'Mike', 'Elaine'],
    'Age': [30, 25, 28, 40, 35]
}
df1 = pd.DataFrame(data1)

# Second DataFrame
data2 = {
    'ID': [3, 4, 5, 6, 7],
    'Name': ['Abhilash', 'Mike', 'Elaine', 'Jasmine', 'Robert'],
    'Score': [85, 90, 78, 88, 79]
}
df2 = pd.DataFrame(data2)

print(df1)
print(df2)

Output

   ID      Name  Age
0   1    Sachin   30
1   2       Ram   25
2   3  Abhilash   28
3   4      Mike   40
4   5    Elaine   35

   ID      Name  Score
0   3  Abhilash     85
1   4      Mike     90
2   5    Elaine     78
3   6   Jasmine     88
4   7    Robert     79

Merging DataFrames on multiple columns

The primary method to merge DataFrames is the merge function. When merging on multiple columns, provide the column names as a list.

merged_df = pd.merge(df1, df2, on=['ID', 'Name'], how='inner')
print(merged_df)

Output

   ID      Name  Age  Score
0   3  Abhilash   28     85
1   4      Mike   40     90
2   5    Elaine   35     78

In the above code:

on specifies the columns on which the merge operation is based.

how defines the type of merge. The ‘inner’ merge retains only rows with matching values in both DataFrames.

Understanding different types of merges

While ‘inner’ merge is common, there are other types you should be aware of:

Inner: Only rows with matching values in both DataFrames are retained.

Left: All rows from the left DataFrame and matching rows from the right DataFrame are retained. NaN for non-matching rows.

Right: All rows from the right DataFrame and matching rows from the left DataFrame are retained. NaN for non-matching rows.

Outer: All rows from both DataFrames are retained, with NaN in places where no match is found.

# Example of a left merge
left_merged_df = pd.merge(df1, df2, on=['ID', 'Name'], how='left')
print(left_merged_df)

Output

   ID      Name  Age  Score
0   1    Sachin   30    NaN
1   2       Ram   25    NaN
2   3  Abhilash   28   85.0
3   4      Mike   40   90.0
4   5    Elaine   35   78.0

Potential pitfalls and solutions

Suffixes: When columns, other than the ones you’re merging on, have the same name, Pandas appends suffixes. Use the suffixes parameter to customize these.

custom_merged_df = pd.merge(df1, df2, on=['ID', 'Name'], how='inner', suffixes=('_Left', '_Right'))
print(custom_merged_df)

Output

   ID      Name  Age  Score
0   3  Abhilash   28     85
1   4      Mike   40     90
2   5    Elaine   35     78

Mismatched data types: Ensure columns you’re merging on have consistent data types across both DataFrames.

Missing values: Rows with NaN in the merge columns will not be matched. Ensure data consistency before merging.

Refer more on python here :
Refer more on python here : PySpark 

Author: user