PySpark : Creation of data series with customizable parameters

Spark_Pandas_Freshers_in

Series() enables users to create data series akin to its Pandas counterpart. Let’s delve into its functionality and explore practical examples to grasp its utility.

Understanding Series()

The Series() function in the Pandas API on Spark is designed to create data series, akin to Pandas Series, allowing users to manipulate and analyze data effectively. It offers customizable parameters to tailor the series to specific requirements, providing flexibility in data handling.

Syntax

Series([data, index, dtype, name, copy, ...])
  • data (optional): The data to initialize the series.
  • index (optional): The index for the series.
  • dtype (optional): The data type for the series.
  • name (optional): The name for the series.
  • copy (optional): Specifies whether to copy data or not.
  • Additional parameters for customizing the series.

Practical Examples

Let’s explore practical examples to understand how Series() functions and its versatility in data manipulation.

Example 1

from pyspark.sql import SparkSession
import pandas as pd
# Initialize Spark Session
spark = SparkSession.builder \
    .appName("series_example @ Freshers.in Learning") \
    .getOrCreate()
# Create a Pandas Series
data = [10, 20, 30, 40, 50]
index = ['A', 'B', 'C', 'D', 'E']
series = pd.Series(data, index=index)
# Convert to Pandas-on-Spark Series
sdf = spark.createDataFrame(pd.DataFrame({'data': data, 'index': index}))
# Show Series
sdf.show()

Output:

+----+-----+
|data|index|
+----+-----+
|  10|    A|
|  20|    B|
|  30|    C|
|  40|    D|
|  50|    E|
+----+-----+

Example 2: Customizing Series

# Create a Pandas Series with custom dtype and name
data = [10.5, 20.3, 30.7, 40.2, 50.9]
index = ['A', 'B', 'C', 'D', 'E']
dtype = 'float64'
name = 'MySeries'
series = pd.Series(data, index=index, dtype=dtype, name=name)
# Convert to Pandas-on-Spark Series
sdf = spark.createDataFrame(pd.DataFrame({'data': data, 'index': index}))
# Show Series
sdf.show()

Output:

+-----+-----+
| data|index|
+-----+-----+
| 10.5|    A|
| 20.3|    B|
| 30.7|    C|
| 40.2|    D|
| 50.9|    E|
+-----+-----+

Example 3: Copying Series

# Create a Pandas Series and copy it
data = {'A': 10, 'B': 20, 'C': 30, 'D': 40, 'E': 50}
series = pd.Series(data)
copy_series = series.copy()
# Convert to Pandas-on-Spark Series
sdf = spark.createDataFrame(pd.DataFrame({'data': data.values(), 'index': data.keys()}))
# Show Series
sdf.show()

Output:

+----+-----+
|data|index|
+----+-----+
|  10|    A|
|  20|    B|
|  30|    C|
|  40|    D|
|  50|    E|
+----+-----+
Author: user