The Pandas API on Spark facilitates this fusion, enabling users to read Excel files into Pandas-on-Spark DataFrames or Series effortlessly. In this article, we’ll dive into the read_excel
function’s usage, complete with examples and outputs.
Understanding read_excel
The read_excel
function in the Pandas API on Spark allows users to read Excel files into Pandas-on-Spark DataFrames or Series, providing a seamless solution for handling tabular data stored in Excel format. This functionality opens up new avenues for data processing, enabling users to leverage Spark’s distributed computing capabilities while retaining the familiar interface of Pandas. Let’s explore its usage with examples.
Example Usage
Suppose we have an Excel file named data.xlsx
containing some sample data in a sheet named Sheet1
. We can read this Excel file into a Pandas-on-Spark DataFrame using read_excel
.
from pyspark.sql import SparkSession
import pandas as pd
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Reading Excel File into Pandas-on-Spark DataFrame") \
.getOrCreate()
# Specify the path to the Excel file
excel_file_path = "data.xlsx"
# Read Excel file into Pandas-on-Spark DataFrame
df_spark = pd.read_excel(excel_file_path, sheet_name="Sheet1")
# Show the contents of the DataFrame
df_spark.show()
# Stop SparkSession
spark.stop()
Upon executing the code, the contents of the Excel file data.xlsx
will be displayed as a Pandas-on-Spark DataFrame.
+-------+---+------+
| Name|Age|Gender|
+-------+---+------+
| Sachin| 30|Female|
| Ram| 35| Male|
|Sreerag| 40| Male|
| Dravid| 45| Male|
+-------+---+------+
Spark important urls to refer