How to create an array containing a column repeated count times – PySpark

PySpark @ Freshers.in

For repeating array elements k times in PySpark we can use the below library.

Library : pyspark.sql.functions.array_repeat

array_repeat is a collection function that creates an array containing a column repeated count times.

Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import array_repeat
df = spark.createDataFrame([('a1','a2','a3'),('b1','b2','b3'),\
    ('c1','c2','c3'),('d1','d2','d3'),\
    ('e1','e2','e3')], ['data1','data2','data3'])
df.show()
+-----+-----+-----+
|data1|data2|data3|
+-----+-----+-----+
|   a1|   a2|   a3|
|   b1|   b2|   b3|
|   c1|   c2|   c3|
|   d1|   d2|   d3|
|   e1|   e2|   e3|
+-----+-----+-----+
df2 = df.select(df.data1,df.data2,df.data3,\
    array_repeat(df.data3, 3).alias('repeat_data3'),\
    array_repeat(df.data1, 2).alias('repeat_data1'))
df2.show()
+-----+-----+-----+------------+------------+
|data1|data2|data3|repeat_data3|repeat_data1|
+-----+-----+-----+------------+------------+
|   a1|   a2|   a3|[a3, a3, a3]|    [a1, a1]|
|   b1|   b2|   b3|[b3, b3, b3]|    [b1, b1]|
|   c1|   c2|   c3|[c3, c3, c3]|    [c1, c1]|
|   d1|   d2|   d3|[d3, d3, d3]|    [d1, d1]|
|   e1|   e2|   e3|[e3, e3, e3]|    [e1, e1]|
+-----+-----+-----+------------+------------+

 

Author: user

Leave a Reply