The regexp_extract function in PySpark is used for extracting specific parts of a string that match a given regular expression pattern. This function is invaluable in scenarios where data needs to be parsed or subdivided into more manageable components. This article aims to shed light on this function, providing insights into its usage with practical examples.
Syntax:
regexp_extract(str, pattern, idx)
str: The string column to be searched.
pattern: The regular expression pattern defining the part of the string to extract.
idx: The index of the group in the regular expression to extract. Indexing starts from 1.
Example: Extracting Information from Names
Let’s consider an example where we use regexp_extract
to extract specific parts from a list of names.
Dataset Example:
Full_Name |
---|
Sachin Tendulkar |
Ram Nath Kovind |
Raju Srivastava |
David Beckham |
Wilson Raynor |
Suppose we want to extract the last names from these full names.
Step-by-Step Implementation:
Initializing the PySpark Environment: Start by setting up your PySpark session and importing the necessary function.
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_extract
spark = SparkSession.builder.appName("regexp_extract_example").getOrCreate()
Creating the DataFrame:
Create a DataFrame with the given names.
data = [("Sachin Tendulkar",), ("Ram Nath Kovind",), ("Raju Srivastava",), ("David Beckham",), ("Wilson Raynor",)]
df = spark.createDataFrame(data, ["Full_Name"])
df.show()
Applying regexp_extract
We will use a regular expression to extract the last name from each full name.
extracted_df = df.withColumn("Last_Name", regexp_extract("Full_Name", r"(\w+)$", 1))
extracted_df.show()
The regular expression (\w+)$
is designed to capture the last word in the string, which in our case is the last name.
Output:
Full_Name | Last_Name |
---|---|
Sachin Tendulkar | Tendulkar |
Ram Nath Kovind | Kovind |
Raju Srivastava | Srivastava |
David Beckham | Beckham |
Wilson Raynor | Raynor |
The regexp_extract function in PySpark is a highly efficient tool for extracting specific patterns from strings. Its ability to dissect and retrieve relevant information from text data makes it a valuable asset in any data professional’s toolkit.
Spark important urls to refer