This article will focus on a particular use case: returning an array that contains the matching elements in two input arrays in PySpark. To illustrate this, we’ll use PySpark’s built-in functions and DataFrame transformations.
PySpark does not provide a direct function to compare arrays and return the matching elements. However, you can achieve this by utilizing some of its in-built functions like explode, collect_list, and array_intersect.
Let’s assume we have a DataFrame that has two columns, both of which contain arrays:
from pyspark.sql import SparkSession
from pyspark.sql.functions import array
spark = SparkSession.builder.getOrCreate()
data = [
("1", list(["apple", "banana", "cherry"]), list(["banana", "cherry", "date"])),
("2", list(["pear", "mango", "peach"]), list(["mango", "peach", "lemon"])),
]
df = spark.createDataFrame(data, ["id", "Array1", "Array2"])
df.show()
DataFrame is created successfully.
To return an array with the matching elements in ‘Array1’ and ‘Array2’, use the array_intersect
function:
from pyspark.sql.functions import array_intersect
df_with_matching_elements = df.withColumn("MatchingElements", array_intersect(df.Array1, df.Array2))
df_with_matching_elements.show(20,False)
The ‘MatchingElements’ column will contain the matching elements in ‘Array1’ and ‘Array2’ for each row.
Using the PySpark array_intersect function, you can efficiently find matching elements in two arrays. This function is not only simple and efficient but also scalable, making it a great tool for processing and analyzing big data with PySpark. It’s important to remember, however, that this approach works on a row-by-row basis. If you want to find matches across all rows in the DataFrame, you’ll need to apply a different technique.
+---+-----------------------+----------------------+----------------+
|id |Array1 |Array2 |MatchingElements|
+---+-----------------------+----------------------+----------------+
|1 |[apple, banana, cherry]|[banana, cherry, date]|[banana, cherry]|
|2 |[pear, mango, peach] |[mango, peach, lemon] |[mango, peach] |
+---+-----------------------+----------------------+----------------+
Spark important urls to refer