How to removes duplicate values from array in PySpark

user January 27, 2022 Leave a Comment

This blog will show you , how to remove the duplicates in an column with array elements. Consider the below example. In this one of the column we have array elements and we can see duplicates exists . No we need to remove the duplicates from that array.

Input

['India','India','United States','Bahrain','India']
['Argentina','Bahrain','United States','Argentina','India']
['China','Bahrain','United States','Bahrain','Greenland']
['India','Bahrain','United States','Bahrain','Greenland']
['United States','Bahrain','United States','Bahrain']

Pyspark provides array_distinct function to handle this

array_distinct(array)

Syntax : array_distinct(array)

The function will return an array of the same type as the input argument after removing all duplicate values.

Example Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import array_distinct
#Removes duplicate values from array.
spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate()
df = spark.createDataFrame([(['India','India','United States','Bahrain','India'],),(['Argentina','Bahrain','United States','Argentina','India'],) ,(['China','Bahrain','United States','Bahrain','Greenland'],),(['India','Bahrain','United States','Bahrain','Greenland'],),(['United States','Bahrain','United States','Bahrain'],)],['country'])
df.show(20,False)
df2=df.select(array_distinct(df.country))
df2.show(20,False)

Note : We use Python 3.7.10 and Spark 3.0.1