Apache Spark has emerged as a powerhouse, offering unparalleled scalability and performance. Leveraging the familiar syntax of Pandas API on Spark can streamline data manipulations and SQL operations. In this article, we delve into harnessing the ‘merge’ function, allowing seamless DataFrame merges akin to database-style joins.
Introduction to the ‘merge’ Function
The ‘merge’ function in Pandas API on Spark facilitates merging DataFrame objects with a database-style join operation. This powerful function enables users to combine datasets based on common columns or indices, akin to SQL join operations.
Syntax:
merge(obj, right, how='inner', on=None, left_on=None, right_on=None, ...)
- obj: DataFrame to merge with.
- right: DataFrame or Spark DataFrame to merge.
- how: Type of merge to be performed (‘inner’, ‘outer’, ‘left’, ‘right’).
- on: Column names to join on (if columns are the same in both DataFrames).
- left_on: Column names from the left DataFrame to join on.
- right_on: Column names from the right DataFrame to join on.
Example: Performing a Database-Style Join
Consider two DataFrames representing employee information and department assignments:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("Pandas API on Spark Example") \
.getOrCreate()
# Sample data
employee_data = [('Sachin', 'Engineering'),
('Raju', 'Sales'),
('Boby', 'Engineering')]
department_data = [('Engineering', 'Engineering Department'),
('Sales', 'Sales Department')]
# Create Spark DataFrames
employee_df = spark.createDataFrame(employee_data, ['Name', 'Department'])
department_df = spark.createDataFrame(department_data, ['Department', 'Department_Name'])
print("Employee DataFrame:")
employee_df.show()
print("Department DataFrame:")
department_df.show()
# Perform merge operation using join
merged_df = employee_df.join(department_df, on='Department', how='inner')
print("\nMerged DataFrame:")
merged_df.show()
Employee DataFrame:
+------+-----------+
| Name| Department|
+------+-----------+
|Sachin|Engineering|
| Raju| Sales|
| Boby|Engineering|
+------+-----------+
Department DataFrame:
+-----------+--------------------+
| Department| Department_Name|
+-----------+--------------------+
|Engineering|Engineering Depar...|
| Sales| Sales Department|
+-----------+--------------------+
Merged DataFrame:
+-----------+------+--------------------+
| Department| Name| Department_Name|
+-----------+------+--------------------+
|Engineering|Sachin|Engineering Depar...|
|Engineering| Boby|Engineering Depar...|
| Sales| Raju| Sales Department|
+-----------+------+--------------------+
Spark important urls to refer