array_except
In PySpark , array_except will returns an array of the elements in one column but not in another column and without duplicates.
Syntax :
array_except(array1, array2)
array1: An ARRAY of any type with comparable elements.
array2: An ARRAY of elements sharing a least common type with the elements of array1.
Example
from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.functions import array_except spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate() raw_data= [("Berkshire",["Alabama","Alaska","Arizona"],["Alabama","Alaska","Arizona","Arkansas"]), ("Allianz",["California","Connecticut","Delaware"],["California","Colorado","Connecticut","Delaware"]), ("Zurich",["Delaware","Florida","Georgia","Hawaii","Idaho"],["Delaware","Florida","Georgia","Hawaii","Idaho"]), ("AIA",["Iowa","Kansas","Kentucky"],["Iowa","Kansas","Kentucky","Louisiana"]), ("Munich",["Hawaii","Idaho","Illinois","Indiana"],["Hawaii","Illinois","Indiana"])] df = spark.createDataFrame(data=raw_data,schema=["Insurace_Provider","Country_2022","Country_2023"]) df.show(20,False) df2=df.select(array_except(df.Country_2023,df.Country_2022)) df2.show(20,False) df3=df.select(array_except(df.Country_2022,df.Country_2023)) df3.show(20,False) df4= df.withColumn("Insurance_Company",df.Insurace_Provider)\ .withColumn("Newly_Introduced_Country",array_except(df.Country_2023,df.Country_2022))\ .withColumn("Operation_Closed_Country",array_except(df.Country_2022,df.Country_2023)) df4.show(20,False)
Reference
Result with Code