Extracting Unique Values From Array Columns in PySpark

PySpark @ Freshers.in

When dealing with data in Spark, you may find yourself needing to extract distinct values from array columns. This can be particularly challenging when working with large datasets, but PySpark’s array and dataframe functions can make this process much easier.

In this article, we’ll walk you through how to extract an array containing the distinct values from arrays in a column in PySpark. We will demonstrate this process using some sample data, which you can execute directly.

Let’s create a PySpark DataFrame to illustrate this process:

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()

data = [("James", ["Java", "C++", "Python"]),
        ("Michael", ["Python", "Java", "C++", "Java"]),
        ("Robert", ["CSharp", "VB", "Python", "Java", "Python"])]

df = spark.createDataFrame(data, ["Name", "Languages"])
df.show(truncate=False)

Result

+-------+-------------------------+
|Name   |Languages                |
+-------+-------------------------+
|James  |[Java, C++, Python]      |
|Michael|[Python, Java, C++, Java]|
|Robert |[CSharp, VB, Python, Java, Python]|
+-------+-------------------------+

Here, the column Languages is an array type column containing programming languages known by each person. As you can see, there are some duplicate values in each array. Now, let’s extract the distinct values from this array.

Using explode and distinct Functions
The first method involves using the explode function to convert the array into individual rows and then using the distinct function to remove duplicates:

df2 = df.withColumn("Languages", explode(df["Languages"]))\
    .dropDuplicates(["Name", "Languages"])

df2.show(truncate=False)

Result

+-------+---------+
|Name   |Languages|
+-------+---------+
|James  |Python   |
|James  |Java     |
|James  |C++      |
|Michael|Java     |
|Robert |Java     |
|Robert |CSharp   |
|Robert |Python   |
|Robert |VB       |
|Michael|C++      |
|Michael|Python   |
+-------+---------+

Here, the explode function creates a new row for each element in the given array or map column, and the dropDuplicates function eliminates duplicate rows.

However, the result is not an array but rather individual rows. To get an array of distinct values for each person, we can group the data by the ‘Name’ column and use the collect_list function:

df3 = df2.groupBy("Name").agg(collect_list("Languages").alias("DistinctLanguages"))
df3.show(truncate=False)

Result

+-------+--------------------------+
|Name   |DistinctLanguages         |
+-------+--------------------------+
|James  |[Python, Java, C++]       |
|Michael|[Java, C++, Python]       |
|Robert |[Java, CSharp, Python, VB]|
+-------+--------------------------+

You want to get the list of all the Languages without duplicate , you can perform the below

df4 = df.select(explode(df["Languages"])).dropDuplicates(["col"])
df4.show(truncate=False)
+------+
|col   |
+------+
|C++   |
|Python|
|Java  |
|CSharp|
|VB    |
+------+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply