PySpark : Combine two or more arrays into a single array of tuple

PySpark @ Freshers.in

pyspark.sql.functions.arrays_zip

In PySpark, the arrays_zip function can be used to combine two or more arrays into a single array of tuple. Each tuple in the resulting array contains elements from the corresponding position in the input arrays. This will returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], ['Sam John', 'Perter Walter', 'Johns Mike']))], ['si_no', 'name'])
df.show(20,False)
+---------+-------------------------------------+
|si_no    |name                                 |
+---------+-------------------------------------+
|[1, 2, 3]|[Sam John, Perter Walter, Johns Mike]|
+---------+-------------------------------------+
zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)

Result

zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)

You can also use arrays_zip with more than two arrays as input. For example:

from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], ['Sam John', 'Perter Walter', 'Johns Mike'],[23,43,41]))], ['si_no', 'name','age'])
zipped_array = df.select(arrays_zip(df.si_no,df.name,df.age))
zipped_array.show(20,False)

Result

+----------------------------------------------------------------+
|arrays_zip(si_no, name, age)                                    |
+----------------------------------------------------------------+
|[[1, Sam John, 23], [2, Perter Walter, 43], [3, Johns Mike, 41]]|
+----------------------------------------------------------------+

Spark important urls

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply