Pandas API on Spark : Merging DataFrame objects with a database-style join operation : merge

Spark_Pandas_Freshers_in

Apache Spark has emerged as a powerhouse, offering unparalleled scalability and performance. Leveraging the familiar syntax of Pandas API on Spark can streamline data manipulations and SQL operations. In this article, we delve into harnessing the ‘merge’ function, allowing seamless DataFrame merges akin to database-style joins.

Introduction to the ‘merge’ Function

The ‘merge’ function in Pandas API on Spark facilitates merging DataFrame objects with a database-style join operation. This powerful function enables users to combine datasets based on common columns or indices, akin to SQL join operations.

Syntax:

merge(obj, right, how='inner', on=None, left_on=None, right_on=None, ...)
  • obj: DataFrame to merge with.
  • right: DataFrame or Spark DataFrame to merge.
  • how: Type of merge to be performed (‘inner’, ‘outer’, ‘left’, ‘right’).
  • on: Column names to join on (if columns are the same in both DataFrames).
  • left_on: Column names from the left DataFrame to join on.
  • right_on: Column names from the right DataFrame to join on.

Example: Performing a Database-Style Join

Consider two DataFrames representing employee information and department assignments:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Pandas API on Spark Example") \
    .getOrCreate()

# Sample data
employee_data = [('Sachin', 'Engineering'),
                 ('Raju', 'Sales'),
                 ('Boby', 'Engineering')]

department_data = [('Engineering', 'Engineering Department'),
                   ('Sales', 'Sales Department')]

# Create Spark DataFrames
employee_df = spark.createDataFrame(employee_data, ['Name', 'Department'])
department_df = spark.createDataFrame(department_data, ['Department', 'Department_Name'])

print("Employee DataFrame:")
employee_df.show()

print("Department DataFrame:")
department_df.show()

# Perform merge operation using join
merged_df = employee_df.join(department_df, on='Department', how='inner')

print("\nMerged DataFrame:")
merged_df.show()
Output
Employee DataFrame:
+------+-----------+
|  Name| Department|
+------+-----------+
|Sachin|Engineering|
|  Raju|      Sales|
|  Boby|Engineering|
+------+-----------+

Department DataFrame:
+-----------+--------------------+
| Department|     Department_Name|
+-----------+--------------------+
|Engineering|Engineering Depar...|
|      Sales|    Sales Department|
+-----------+--------------------+


Merged DataFrame:
+-----------+------+--------------------+
| Department|  Name|     Department_Name|
+-----------+------+--------------------+
|Engineering|Sachin|Engineering Depar...|
|Engineering|  Boby|Engineering Depar...|
|      Sales|  Raju|    Sales Department|
+-----------+------+--------------------+
The ‘merge’ function in Pandas API on Spark empowers users to perform database-style joins seamlessly, enhancing data manipulation capabilities in the Spark ecosystem.
Author: user