How to create an array containing a column repeated count times – PySpark

PySpark @

For repeating array elements k times in PySpark we can use the below library.

Library : pyspark.sql.functions.array_repeat

array_repeat is a collection function that creates an array containing a column repeated count times.


from pyspark.sql import SparkSession
from pyspark.sql.functions import array_repeat
df = spark.createDataFrame([('a1','a2','a3'),('b1','b2','b3'),\
    ('e1','e2','e3')], ['data1','data2','data3'])
|   a1|   a2|   a3|
|   b1|   b2|   b3|
|   c1|   c2|   c3|
|   d1|   d2|   d3|
|   e1|   e2|   e3|
df2 =,df.data2,df.data3,\
    array_repeat(df.data3, 3).alias('repeat_data3'),\
    array_repeat(df.data1, 2).alias('repeat_data1'))
|   a1|   a2|   a3|[a3, a3, a3]|    [a1, a1]|
|   b1|   b2|   b3|[b3, b3, b3]|    [b1, b1]|
|   c1|   c2|   c3|[c3, c3, c3]|    [c1, c1]|
|   d1|   d2|   d3|[d3, d3, d3]|    [d1, d1]|
|   e1|   e2|   e3|[e3, e3, e3]|    [e1, e1]|


Author: user

Leave a Reply