How to find difference between two arrays in PySpark(array_except)

PySpark @ Freshers.in

array_except

In PySpark , array_except will returns an array of the elements in one column but not in another column and without duplicates.

Syntax :

array_except(array1, array2)

array1: An ARRAY of any type with comparable elements.
array2: An ARRAY of elements sharing a least common type with the elements of array1.

Example

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import array_except
spark = SparkSession.builder.appName('www.freshers.in training').getOrCreate()
raw_data= [("Berkshire",["Alabama","Alaska","Arizona"],["Alabama","Alaska","Arizona","Arkansas"]),
("Allianz",["California","Connecticut","Delaware"],["California","Colorado","Connecticut","Delaware"]),
("Zurich",["Delaware","Florida","Georgia","Hawaii","Idaho"],["Delaware","Florida","Georgia","Hawaii","Idaho"]),
("AIA",["Iowa","Kansas","Kentucky"],["Iowa","Kansas","Kentucky","Louisiana"]),
("Munich",["Hawaii","Idaho","Illinois","Indiana"],["Hawaii","Illinois","Indiana"])]
df = spark.createDataFrame(data=raw_data,schema=["Insurace_Provider","Country_2022","Country_2023"])
df.show(20,False)
df2=df.select(array_except(df.Country_2023,df.Country_2022))
df2.show(20,False)
df3=df.select(array_except(df.Country_2022,df.Country_2023))
df3.show(20,False)
df4= df.withColumn("Insurance_Company",df.Insurace_Provider)\
.withColumn("Newly_Introduced_Country",array_except(df.Country_2023,df.Country_2022))\
.withColumn("Operation_Closed_Country",array_except(df.Country_2022,df.Country_2023))
df4.show(20,False)

Reference

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions

Result with Code

array_except@Freshers.in

Author: user

Leave a Reply