Counting the number of occurrences of each value in a string column with multiple values separated by a delimiter is a common task in data preprocessing and cleaning. PySpark provides a simple and efficient way to split the string column into multiple columns based on the delimiter and count the number of occurrences of each value using its built-in functions.
Input Data
Let’s assume we have the following dataset that contains a string column with multiple values separated by a comma:
+----+------------+
| ID | Items |
+----+------------+
| 1 | A,B,C,D,E |
| 2 | A,C,F,G,H |
| 3 | B,C,D,G,H |
| 4 | A,C,D,E,F |
+----+------------+
Counting the Number of Occurrences of Each Value in a String Column in PySpark
To count the number of occurrences of each value in a string column with multiple values separated by a delimiter in PySpark, we can use the split and explode functions. The split function splits the string column into an array of strings based on the delimiter, and the explode function creates a new row for each element in the array. We can then group the rows by the exploded column and count the number of occurrences of each value using the count function.
For example, to count the number of occurrences of each value in the Items column in the input DataFrame, we can use the following code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, count
# create a SparkSession
spark = SparkSession.builder.appName("CountOccurrences").getOrCreate()
# load the input data into a DataFrame
df = spark.createDataFrame([
(1, "A,B,C,D,E"),
(2, "A,C,F,G,H"),
(3, "B,C,D,G,H"),
(4, "A,C,D,E,F")
], ["ID", "Items"])
# split the Items column into an array of strings and explode the array into multiple rows
df_exploded = df.select("ID", explode(split("Items", ","))).alias("Item")
# group the rows by the Item column and count the number of occurrences of each value
df_count = df_exploded.groupBy("col").agg(count("ID").alias("Count")).orderBy("Count", ascending=False)
# show the result
df_count.show()
+---+-----+
|col|Count|
+---+-----+
| C| 4|
| D| 3|
| A| 3|
| F| 2|
| E| 2|
| B| 2|
| H| 2|
| G| 2|
+---+-----+
Counting the number of occurrences of each value in a string column with multiple values separated by a delimiter in PySpark is a simple and efficient process using the split, explode, and count functions. By splitting the string column into an array of strings and exploding the array into multiple rows, we can easily count the number of occurrences of each value in the column.
Spark important urls to refer