Join operations are fundamental to data manipulation and analysis. Among these, the outer join stands out for its ability to merge data based on shared columns, while retaining unmatched values from both DataFrames. In this guide, we’ll dive deep into the concept of outer joins in Pandas using comprehensive examples.
To understand the outer join operation, let’s consider two sample DataFrames. One DataFrame contains the scores of students in Mathematics, and the other contains scores in English.
import pandas as pd
# Sample data for students' scores in Mathematics
math_scores = pd.DataFrame({
'Student_ID': [101, 102, 103, 104],
'Math_Score': [85, 92, 78, 88]
})
# Sample data for students' scores in English
english_scores = pd.DataFrame({
'Student_ID': [102, 103, 104, 105],
'English_Score': [87, 81, 92, 78]
})
print(math_scores)
print(english_scores)
Performing an Outer Join:
An outer join merges both DataFrames, keeping all records from both frames and placing NaNs where data is absent in one of them.
merged_data = pd.merge(math_scores, english_scores, on='Student_ID', how='outer')
print(merged_data)
The output represents a combination of both DataFrames. Students with ID 101 and 105 have NaN values because they are missing in one of the original frames.
Output
Student_ID Math_Score English_Score
0 101 85.0 NaN
1 102 92.0 87.0
2 103 78.0 81.0
3 104 88.0 92.0
4 105 NaN 78.0
The outer join operation provides a comprehensive view of students’ scores in both subjects. Notice that:
Student 101 has a score in Mathematics but not in English, resulting in a NaN for English_Score.
Conversely, Student 105 has an English score but not a Mathematics score, resulting in a NaN for Math_Score.
All other students have scores recorded in both subjects.