pyspark.sql.functions.arrays_zip
In PySpark, the arrays_zip function can be used to combine two or more arrays into a single array of tuple. Each tuple in the resulting array contains elements from the corresponding position in the input arrays. This will returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], ['Sam John', 'Perter Walter', 'Johns Mike']))], ['si_no', 'name'])
df.show(20,False)
+---------+-------------------------------------+
|si_no |name |
+---------+-------------------------------------+
|[1, 2, 3]|[Sam John, Perter Walter, Johns Mike]|
+---------+-------------------------------------+
zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)
Result
zipped_array = df.select(arrays_zip(df.si_no,df.name))
zipped_array.show(20,False)
You can also use arrays_zip with more than two arrays as input. For example:
from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], ['Sam John', 'Perter Walter', 'Johns Mike'],[23,43,41]))], ['si_no', 'name','age'])
zipped_array = df.select(arrays_zip(df.si_no,df.name,df.age))
zipped_array.show(20,False)
Result
+----------------------------------------------------------------+
|arrays_zip(si_no, name, age) |
+----------------------------------------------------------------+
|[[1, Sam John, 23], [2, Perter Walter, 43], [3, Johns Mike, 41]]|
+----------------------------------------------------------------+
Spark important urls