Removing leading spaces (spaces on the left side) from a string in PySpark

PySpark @ Freshers.in

PySpark, a leading tool in big data processing, provides several functions for string manipulation, one of which is ltrim. This article will focus on the ltrim function, its advantages, and its application in a real-world context. The ltrim function in PySpark is used to remove leading spaces (spaces on the left side) from a string. This function is particularly useful in cleaning and standardizing data. The syntax for ltrim is:

from pyspark.sql.functions import ltrim

Advantages of using ltrim

  • Data Cleansing: It’s essential for removing unwanted leading whitespace, which can occur during data collection or transformation.
  • Data Standardization: Ensures consistency in string data, facilitating more accurate analysis and comparison.
  • Improved Readability: Enhances the readability of data, especially when visualizing or presenting it.

Use case: Standardizing name entries

Consider a dataset with names that have inconsistent leading spaces: “Sachin”, ”             Ram”, ” Raju”, “David”, ” Wilson”. Our goal is to standardize these entries by removing any leading spaces.

Objective

Remove leading spaces from each name to standardize the dataset.

Implementation in PySpark

Setting up the PySpark environment and creating the DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import ltrim
# Initialize Spark Session
spark = SparkSession.builder.appName("Ltrim Example").getOrCreate()
# Sample Data
data = [("   Sachin",), ("                  Ram",), (" Raju",), ("David",), (" Wilson",)]
# Creating DataFrame
df = spark.createDataFrame(data, ["Name"])
df.show(20,False)

Output

+---------------------+
|Name                 |
+---------------------+
|   Sachin            |
|                  Ram|
| Raju                |
|David                |
| Wilson              |
+---------------------+

Applying the ltrim function:

# Using Ltrim Function
trimmed_df = df.withColumn("TrimmedName", ltrim(df.Name))
trimmed_df.show(20,False)
+---------------------+-----------+
|Name                 |TrimmedName|
+---------------------+-----------+
|   Sachin            |Sachin     |
|                  Ram|Ram        |
| Raju                |Raju       |
|David                |David      |
| Wilson              |Wilson     |
+---------------------+-----------+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user