Leveraging Pandas API on Spark to Read Excel Files : read_excel

Spark_Pandas_Freshers_in

The Pandas API on Spark facilitates this fusion, enabling users to read Excel files into Pandas-on-Spark DataFrames or Series effortlessly. In this article, we’ll dive into the read_excel function’s usage, complete with examples and outputs.

Understanding read_excel

The read_excel function in the Pandas API on Spark allows users to read Excel files into Pandas-on-Spark DataFrames or Series, providing a seamless solution for handling tabular data stored in Excel format. This functionality opens up new avenues for data processing, enabling users to leverage Spark’s distributed computing capabilities while retaining the familiar interface of Pandas. Let’s explore its usage with examples.

Example Usage

Suppose we have an Excel file named data.xlsx containing some sample data in a sheet named Sheet1. We can read this Excel file into a Pandas-on-Spark DataFrame using read_excel.

from pyspark.sql import SparkSession
import pandas as pd

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Reading Excel File into Pandas-on-Spark DataFrame") \
    .getOrCreate()

# Specify the path to the Excel file
excel_file_path = "data.xlsx"

# Read Excel file into Pandas-on-Spark DataFrame
df_spark = pd.read_excel(excel_file_path, sheet_name="Sheet1")

# Show the contents of the DataFrame
df_spark.show()

# Stop SparkSession
spark.stop()
Output

Upon executing the code, the contents of the Excel file data.xlsx will be displayed as a Pandas-on-Spark DataFrame.

+-------+---+------+
|   Name|Age|Gender|
+-------+---+------+
|  Sachin| 30|Female|
|    Ram| 35|  Male|
|Sreerag| 40|  Male|
|  Dravid| 45|  Male|
+-------+---+------+
Author: user