How to get the common elements from two arrays in two columns in PySpark (array_intersect)

PySpark @ Freshers.in

array_intersect

When you want to get the common elements from two arrays in two columns in PySpark you can use the array_intersect function.

Collection function: array_intersect returns an array of the elements in the intersection of column_1_array and column_2_array, without duplicates.An ARRAY of matching type to column_1_array with no duplicates and elements contained in both column_1_array and column_2_array.

Function

pyspark.sql.functions.array_intersect(column_1_array, column_2_array)

Syntax

array_intersect(column_1_array, column_2_array)

column_1_array : An ARRAY of any type with comparable elements.
column_2_array : n ARRAY of elements sharing a least common type with the elements of column_1_array.

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import array_intersect
spark = SparkSession.builder.appName('www.freshers.in training array_intersect').getOrCreate()
raw_data= [("Berkshire",["Alabama","Alaska","Arizona","California"],["Alabama","Alaska","Arizona","Arkansas"]),
("Allianz",["California","Connecticut","Delaware","Alabama"],["California","Colorado","Connecticut","Delaware"]),
("Zurich",["Delaware","Florida","Georgia","Hawaii","Idaho"],["Delaware","Florida","Georgia","Hawaii","Louisiana"]),
("AIA",["Iowa","Kansas","Kentucky"],["Iowa","Kansas","Kentucky","Louisiana"]),
("Munich",["Hawaii","Idaho","Illinois","Indiana"],["Hawaii","Illinois","Indiana"])]
df = spark.createDataFrame(data=raw_data,schema=["Insurace_Provider","Country_2022","Country_2023"])
df.show(20,False)
df2=df.select(df.Insurace_Provider,array_intersect(df.Country_2023,df.Country_2022))
df2.show(20,False)
df.createOrReplaceTempView("insurancec_tbl")
sqlContext.sql("select * from insurancec_tbl").show()

Reference

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Spark

pyspark_array_intersect

Author: user

Leave a Reply