This blog will show you , how to remove the duplicates in an column with array elements. Consider the below example. In this one of the column we have array elements and we can see duplicates exists . No we need to remove the duplicates from that array.
Input
['India','India','United States','Bahrain','India'] ['Argentina','Bahrain','United States','Argentina','India'] ['China','Bahrain','United States','Bahrain','Greenland'] ['India','Bahrain','United States','Bahrain','Greenland'] ['United States','Bahrain','United States','Bahrain']
Pyspark provides array_distinct function to handle this
array_distinct(array)
Syntax : array_distinct(array)
The function will return an array of the same type as the input argument after removing all duplicate values.
Example Code
from pyspark.sql import SparkSession from pyspark.sql.functions import array_distinct #Removes duplicate values from array. spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate() df = spark.createDataFrame([(['India','India','United States','Bahrain','India'],),(['Argentina','Bahrain','United States','Argentina','India'],) ,(['China','Bahrain','United States','Bahrain','Greenland'],),(['India','Bahrain','United States','Bahrain','Greenland'],),(['United States','Bahrain','United States','Bahrain'],)],['country']) df.show(20,False) df2=df.select(array_distinct(df.country)) df2.show(20,False)
Note : We use Python 3.7.10 and Spark 3.0.1
Video Tutorial