Freshers.in

LPAD, or Left Padding, is a string function in PySpark that adds a specified character to the left of a string until it reaches a certain length. This article delves into the lpad function in PySpark, its advantages, and a practical use case with real data. LPAD in PySpark is an invaluable tool for ensuring data consistency and readability, particularly in scenarios where uniformity in string lengths is crucial. The syntax of the lpad function is:

lpad(column, len, pad)

column: The column or string to be padded.
len: The total length of the string after padding.
pad: The character used for padding.

Advantages of LPAD

Consistency: Ensures uniform length of strings, aiding in consistent data processing.

Alignment: Improves readability, especially in tabular data formats.

Data Integrity: Helps in maintaining data integrity, especially in scenarios where fixed-length strings are required.

Example : Formatting names for standardized reporting

Consider a dataset with the names of individuals: Sachin, Ram, Raju, David, and Wilson. These names vary in length, but for a report, we need them to be of uniform length for better alignment and readability.

Example Dataset

Name
Sachin
Ram
Raju
David
Wilson

Objective

Standardize the length of all names to 10 characters by padding with underscores (_).

Implementation in PySpark

First, let’s set up the PySpark environment and create our initial DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import lpad

# Initialize Spark Session
spark = SparkSession.builder.appName("LPAD Example").getOrCreate()

# Sample Data
data = [("Sachin",), ("Ram",), ("Raju",), ("David",), ("Wilson",)]
df = spark.createDataFrame(data, ["Name"])

# Apply lpad to create a new column 'PaddedName'
df_with_padding = df.withColumn("PaddedName", lpad("Name", 10, "_"))

# Show the result
df_with_padding.show(truncate=False)

Apply the lpad function:

Output

The result is a DataFrame where all names are consistently 10 characters long, padded with underscores:

Name	PaddedName
Sachin	_____Sachin
Ram	________Ram
Raju	_______Raju
David	______David
Wilson	_____Wilson

Spark important urls to refer

Freshers.in

Adding a specified character to the left of a string until it reaches a certain length in PySpark

Advantages of LPAD

Example : Formatting names for standardized reporting

Example Dataset

Objective

Implementation in PySpark

Output

Popular Posts

Categories

Blog Archive

BTemplates.com

Blogroll

About