PySpark : Format phone numbers in a specific way using PySpark

PySpark @ Freshers.in

In this article, we’ll be working with a PySpark DataFrame that contains a column of phone numbers. We’ll use PySpark’s string manipulation functions to format these phone numbers in a specific way, and then save the formatted phone numbers to a new DataFrame.

Sample Data

To demonstrate how to format phone numbers in PySpark, we’ll create a sample DataFrame with some phone numbers. Here’s the code to create the sample data:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FormatPhoneNumbers").getOrCreate()
data = [("John", "123-456-7890"),
        ("Jane", "234-567-8901"),
        ("Bob", "345-678-9012")]
df = spark.createDataFrame(data, ["name", "phone_number"])
df.show()
Output
+----+------------+
|name|phone_number|
+----+------------+
|John|123-456-7890|
|Jane|234-567-8901|
| Bob|345-678-9012|
+----+------------+

The sample data consists of a DataFrame with two columns: “name” and “phone_number”. The phone numbers are in the format “XXX-XXX-XXXX”.

Formatting Phone Numbers

Now that we have our sample data, we can start formatting the phone numbers. Here’s the code to remove any non-numeric characters from the phone numbers:

from pyspark.sql.functions import regexp_replace
df = df.withColumn("phone_number", regexp_replace("phone_number", "[^0-9]", ""))
df.show()

This code uses PySpark’s regexp_replace() function to remove any characters that are not digits from the phone numbers. Now we have phone numbers that only contain digits.

Next, we’ll format the phone numbers in the desired way. Let’s say we want to format the phone numbers as “(XXX) XXX-XXXX”. Here’s the code to do that:

Output

+----+------------+
|name|phone_number|
+----+------------+
|John|  1234567890|
|Jane|  2345678901|
| Bob|  3456789012|
+----+------------+

This code uses PySpark’s regexp_replace() function to remove any characters that are not digits from the phone numbers. Now we have phone numbers that only contain digits.

Next, we’ll format the phone numbers in the desired way. Let’s say we want to format the phone numbers as “(XXX) XXX-XXXX”. Here’s the code to do that:

from pyspark.sql.functions import regexp_replace, col, concat, lit

df = df.withColumn("phone_number", regexp_replace(col("phone_number"), "[^0-9]", ""))
df = df.withColumn("phone_number", 
                   concat(lit("("), substring(col("phone_number"), 1, 3), lit(") "),
                          substring(col("phone_number"), 4, 3), lit("-"),
                          substring(col("phone_number"), 7, 4)))
df.show()
Output
+----+--------------+
|name|  phone_number|
+----+--------------+
|John|(123) 456-7890|
|Jane|(234) 567-8901|
| Bob|(345) 678-9012|
+----+--------------+

This code uses PySpark’s substring() function to extract the first three digits, the next three digits, and the final four digits of each phone number, and then concatenates them together with the desired formatting.

We learned how to use PySpark to format phone numbers in a specific way. We used PySpark’s string manipulation functions to remove non-numeric characters from the phone numbers and to format the phone numbers with the desired formatting.

Author: user

Leave a Reply