Adding a new column to a DataFrame with a constant value

user November 21, 2023

The lit function in PySpark is a straightforward yet powerful tool for adding constant values as new columns in a DataFrame. Its simplicity and versatility make it invaluable for a wide range of data manipulation tasks.This article aims to shed light on the lit function in PySpark, exploring its advantages and practical applications.

Understanding lit in PySpark

The lit function in PySpark is used to add a new column to a DataFrame with a constant value. This function is particularly useful when you need to append a fixed value across all rows of a DataFrame. The syntax for the lit function is straightforward:

from pyspark.sql.functions import lit

Advantages of using lit

Flexibility: Allows adding constants or expressions as new columns.
Simplicity: Easy to use for creating new columns with fixed values.
Data Enrichment: Useful for appending static data to dynamic datasets.

Use case: Adding a constant identifier to a name list

Let’s consider a scenario where we have a dataset containing names: Sachin, Ram, Raju, David, and Wilson. Suppose we want to add a new column that identifies each name as belonging to a particular group.

Dataset

Name
Sachin
Ram
Raju
David
Wilson

Objective

Add a new column, Group, with a constant value ‘GroupA’ for all rows.

Implementation in PySpark

Setting up the PySpark environment and creating the DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
# Initialize Spark Session
spark = SparkSession.builder.appName("Lit Example").getOrCreate()
# Sample Data
data = [("Sachin",), ("Ram",), ("Raju",), ("David",), ("Wilson",)]
# Creating DataFrame
df = spark.createDataFrame(data, ["Name"])
df.show()

Applying the lit function:

Output

The DataFrame now includes a new column, Group, with the constant value ‘GroupA’:

Name	Group
Sachin	GroupA
Ram	GroupA
Raju	GroupA
David	GroupA
Wilson	GroupA

Spark important urls to refer

Post Views: 6

Author: user