One of the most commonly used operations in PySpark is joining two dataframes together. Full outer join is one of the four types of joins that can be performed in PySpark. In this article, we will explain what a full outer join is and how to perform it in PySpark with a detailed example.
What is a Full Outer Join?
A full outer join combines the rows from both the left and right dataframes, filling in null values where there is no match. In other words, it returns all the rows from both dataframes, and where there are no matches, the columns from the other dataframe will be filled with null values.
To illustrate this concept, consider the following two dataframes:
df1:
+---+-------+
| id| name|
+---+-------+
| 1| John Bob|
| 2| Alice Jim|
| 3|Charlie Barry|
+---+-------+
df2:
+---+--------+
| id| email|
+---+--------+
| 1|john.bob@gmail.com|
| 4|alice.jim@yahoo.com|
+---+--------+
A full outer join between df1 and df2 on the id column would result in the following dataframe:
+---+-------+--------------+
| id| name| email|
+---+-------+--------------+
| 1| Bob| john.bob@gmail.com|
| 2| Alice| null|
| 3|Charlie| null|
| 4| null|alice.jim@yahoo.com|
+---+-------+--------------+
As you can see, the resulting dataframe contains all the rows from both df1 and df2, and fills in null values where there are no matches.
How to perform a Full Outer Join in PySpark
To perform a full outer join in PySpark, we can use the join() function on one of the dataframes and specify the other dataframe, the join type, and the join condition. Here is the general syntax:
joined_df = df1.join(df2, on="join_column", how="full_outer")
where df1 and df2 are the dataframes to be joined, “join_column” is the column to join on, and “full_outer” is the join type.
Let’s use the same dataframes from the previous example to perform a full outer join in PySpark:
from pyspark.sql import SparkSession # create a SparkSession spark = SparkSession.builder.appName("Full Outer Join Example at Freshers.in").getOrCreate() # create df1 df1 = spark.createDataFrame([(1, "<code class="language-bash">John Bob
"), (2, "Alice Jim
"), (3, "Charlie Barry
")], ["id", "name"]) df1.show() # create df2 df2 = spark.createDataFrame([(1, "john.bob@gmail
.com"), (4, "alice.jim@yahoo
.com")], ["id", "email"]) # perform a full outer join on the 'id' column joined_df = df1.join(df2, on="id", how="full_outer") # display the resulting dataframe joined_df.show()
Result of df1
+---+-------+
| id| name|
+---+-------+
| 1| John Bob|
| 2| Alice Jim|
| 3|Charlie Barry|
+---+-------+
Result of joined_df
+---+-------+---------------+
| id| name| email|
+---+-------+---------------+
| 1| John Bob| john.bob@gmail.com|
| 3|Charlie| null|
| 2| Alice Jim| null|
| 4| null|alice.jim@yahoo.com|
+---+-------+---------------+
Spark important urls to refer