PySpark : Dataset has column that contains a string with multiple values separated by a delimiter.Count the number of occurrences of each value using PySpark.

user April 3, 2023 Leave a Comment

Counting the number of occurrences of each value in a string column with multiple values separated by a delimiter is a common task in data preprocessing and cleaning. PySpark provides a simple and efficient way to split the string column into multiple columns based on the delimiter and count the number of occurrences of each value using its built-in functions.

Input Data

Let’s assume we have the following dataset that contains a string column with multiple values separated by a comma:

+----+------------+
| ID |    Items   |
+----+------------+
|  1 | A,B,C,D,E  |
|  2 | A,C,F,G,H  |
|  3 | B,C,D,G,H  |
|  4 | A,C,D,E,F  |
+----+------------+

Counting the Number of Occurrences of Each Value in a String Column in PySpark

To count the number of occurrences of each value in a string column with multiple values separated by a delimiter in PySpark, we can use the split and explode functions. The split function splits the string column into an array of strings based on the delimiter, and the explode function creates a new row for each element in the array. We can then group the rows by the exploded column and count the number of occurrences of each value using the count function.

For example, to count the number of occurrences of each value in the Items column in the input DataFrame, we can use the following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, count

# create a SparkSession
spark = SparkSession.builder.appName("CountOccurrences").getOrCreate()

# load the input data into a DataFrame
df = spark.createDataFrame([
    (1, "A,B,C,D,E"),
    (2, "A,C,F,G,H"),
    (3, "B,C,D,G,H"),
    (4, "A,C,D,E,F")
], ["ID", "Items"])

# split the Items column into an array of strings and explode the array into multiple rows
df_exploded = df.select("ID", explode(split("Items", ","))).alias("Item")

# group the rows by the Item column and count the number of occurrences of each value
df_count = df_exploded.groupBy("col").agg(count("ID").alias("Count")).orderBy("Count", ascending=False)

# show the result
df_count.show()

Output

+---+-----+
|col|Count|
+---+-----+
|  C|    4|
|  D|    3|
|  A|    3|
|  F|    2|
|  E|    2|
|  B|    2|
|  H|    2|
|  G|    2|
+---+-----+

Counting the number of occurrences of each value in a string column with multiple values separated by a delimiter in PySpark is a simple and efficient process using the split, explode, and count functions. By splitting the string column into an array of strings and exploding the array into multiple rows, we can easily count the number of occurrences of each value in the column.

Spark important urls to refer

Post Views: 372

Author: user

PySpark : Dataset has column that contains a string with multiple values separated by a delimiter.Count the number of occurrences of each value using PySpark.

Input Data

Counting the Number of Occurrences of Each Value in a String Column in PySpark

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Input Data

Counting the Number of Occurrences of Each Value in a String Column in PySpark

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget