Adding a specified character to the left of a string until it reaches a certain length in PySpark

PySpark @ Freshers.in

LPAD, or Left Padding, is a string function in PySpark that adds a specified character to the left of a string until it reaches a certain length. This article delves into the lpad function in PySpark, its advantages, and a practical use case with real data. LPAD in PySpark is an invaluable tool for ensuring data consistency and readability, particularly in scenarios where uniformity in string lengths is crucial. The syntax of the lpad function is:

lpad(column, len, pad)
column: The column or string to be padded.
len: The total length of the string after padding.
pad: The character used for padding.

Advantages of LPAD

Consistency: Ensures uniform length of strings, aiding in consistent data processing.

Alignment: Improves readability, especially in tabular data formats.

Data Integrity: Helps in maintaining data integrity, especially in scenarios where fixed-length strings are required.

Example : Formatting names for standardized reporting

Consider a dataset with the names of individuals: Sachin, Ram, Raju, David, and Wilson. These names vary in length, but for a report, we need them to be of uniform length for better alignment and readability.

Example Dataset

Name
Sachin
Ram
Raju
David
Wilson

Objective

Standardize the length of all names to 10 characters by padding with underscores (_).

Implementation in PySpark

First, let’s set up the PySpark environment and create our initial DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lpad
# Initialize Spark Session
spark = SparkSession.builder.appName("LPAD Example").getOrCreate()
# Sample Data
data = [("Sachin",), ("Ram",), ("Raju",), ("David",), ("Wilson",)]
# Creating DataFrame
df = spark.createDataFrame(data, ["Name"])
df.show()

Apply the lpad function:

Output

The result is a DataFrame where all names are consistently 10 characters long, padded with underscores:

Name PaddedName
Sachin _____Sachin
Ram ________Ram
Raju _______Raju
David ______David
Wilson _____Wilson
Author: user