PySpark : Dataset has column that contains a string with multiple values separated by a delimiter.Count the number of occurrences of each value using PySpark.

user April 3, 2023 Leave a Comment

Counting the number of occurrences of each value in a string column with multiple values separated by a delimiter is a common task in data preprocessing and cleaning. PySpark provides a simple and efficient way to split the string column into multiple columns based on the delimiter and count the number of occurrences of each value using its built-in functions.

Input Data

Let’s assume we have the following dataset that contains a string column with multiple values separated by a comma:

+----+------------+
| ID |    Items   |
+----+------------+
|  1 | A,B,C,D,E  |
|  2 | A,C,F,G,H  |
|  3 | B,C,D,G,H  |
|  4 | A,C,D,E,F  |
+----+------------+

Counting the Number of Occurrences of Each Value in a String Column in PySpark

To count the number of occurrences of each value in a string column with multiple values separated by a delimiter in PySpark, we can use the split and explode functions. The split function splits the string column into an array of strings based on the delimiter, and the explode function creates a new row for each element in the array. We can then group the rows by the exploded column and count the number of occurrences of each value using the count function.

For example, to count the number of occurrences of each value in the Items column in the input DataFrame, we can use the following code:

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, count

# create a SparkSession
spark = SparkSession.builder.appName("CountOccurrences").getOrCreate()

# load the input data into a DataFrame
df = spark.createDataFrame([
    (1, "A,B,C,D,E"),
    (2, "A,C,F,G,H"),
    (3, "B,C,D,G,H"),
    (4, "A,C,D,E,F")
], ["ID", "Items"])

# split the Items column into an array of strings and explode the array into multiple rows
df_exploded = df.select("ID", explode(split("Items", ","))).alias("Item")

# group the rows by the Item column and count the number of occurrences of each value
df_count = df_exploded.groupBy("col").agg(count("ID").alias("Count")).orderBy("Count", ascending=False)

# show the result
df_count.show()

Output

+---+-----+
|col|Count|
+---+-----+
|  C|    4|
|  D|    3|
|  A|    3|
|  F|    2|
|  E|    2|
|  B|    2|
|  H|    2|
|  G|    2|
+---+-----+

Counting the number of occurrences of each value in a string column with multiple values separated by a delimiter in PySpark is a simple and efficient process using the split, explode, and count functions. By splitting the string column into an array of strings and exploding the array into multiple rows, we can easily count the number of occurrences of each value in the column.

Spark important urls to refer

Post Views: 382

Author: user

PySpark : Dataset has column that contains a string with multiple values separated by a delimiter.Count the number of occurrences of each value using PySpark.

Input Data

Counting the Number of Occurrences of Each Value in a String Column in PySpark

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

Input Data

Counting the Number of Occurrences of Each Value in a String Column in PySpark

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget