PySpark allows for a seamless and efficient way to handle big data processing and manipulation tasks. In this article, we demonstrated how to convert a delimiter-separated string to an array column in PySpark, using a DataFrame with a prefix of freshers_in. We achieved this using the split function from the pyspark.sql.functions module. The example outlined should give a clear insight into handling similar data transformation needs in PySpark, allowing for more versatile and analytical data processing approaches.
Initializing a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("String to array column @Freshers.in Training Example") \
.getOrCreate()
Creating sample data
Let’s create a sample DataFrame freshers_in_df to demonstrate the conversion of delimiter-separated strings to array columns.
from pyspark.sql import Row
data = [Row(freshers_in_name='Dhoni Mahinder', freshers_in_subjects='Math,Physics,Chemistry'),
Row(freshers_in_name='Dhoni Mahinder', freshers_in_subjects='Biology,Physics,Chemistry')]
freshers_in_df = spark.createDataFrame(data)
freshers_in_df.show(truncate=False)
Output
+-------------+------------------------------+
|freshers_in_name|freshers_in_subjects |
+-------------+------------------------------+
|Dhoni Mahinder |Math,Physics,Chemistry |
|Dhoni Mahinder |Biology,Physics,Chemistry|
+-------------+------------------------------+
Conversion of delimiter-separated string to array column
Let’s say the freshers_in_subjects column in our DataFrame freshers_in_df contains strings with subjects separated by commas. We want to convert this column into an array column, where each element of the array is a subject.
You can achieve this using the split function from the pyspark.sql.functions module. The split function takes two arguments: the name of the column to be split and the delimiter.
from pyspark.sql.functions import split
freshers_in_df = freshers_in_df.withColumn("freshers_in_subjects", split("freshers_in_subjects", ","))
freshers_in_df.show(truncate=False)
Output:
+-------------+-------------------------+
|freshers_in_name|freshers_in_subjects |
+-------------+-------------------------+
|Dhoni Mahinder |[Math, Physics, Chemistry]|
|Dhoni Mahinder |[Biology, Physics, Chemistry]|
+-------------+-------------------------+
Now, the freshers_in_subjects column has been successfully converted from a delimiter-separated string to an array column.
Spark important urls to refer