pyspark.sql.functions.collect_list(col)
This is an aggregate function and returns a list of objects with duplicates. To retrieve the data from the PySpark DataFrame columns and return the results in Row format, use the collect_list() method. Because the order of the gathered results depends on the order of the rows, which may change after a shuffle, the function is non-deterministic.
Sample data set ( freshers_sample.csv)
sino,name,date,age,status
1,Sam Peter,08-07-2022,7,1
2,John Manual,15-06-2022,12,2
3,Eric Burst,08-07-2022,6,2
4,Tim Moris,08-09-2022,8,2
5,Jack Berry,08-11-2022,10,3
Sample code with collect_list
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list
spark = SparkSession.builder.appName("collect_list example ").getOrCreate()
df = spark.read.csv('D:\\Learning\\PySpark\\freshers_sample.csv',header=True,inferSchema=True)
df.show()
df.printSchema()
+----+-----------+----------+---+------+
|sino| name| date|age|status|
+----+-----------+----------+---+------+
| 1| Sam Peter|08-07-2022| 7| 1|
| 2|John Manual|15-06-2022| 12| 2|
| 3| Eric Burst|08-07-2022| 6| 2|
| 4| Tim Moris|08-09-2022| 8| 2|
| 5| Jack Berry|08-11-2022| 10| 3|
+----+-----------+----------+---+------+
root
|-- sino: integer (nullable = true)
|-- name: string (nullable = true)
|-- date: string (nullable = true)
|-- age: integer (nullable = true)
|-- status: integer (nullable = true)
df.select(collect_list('age')).show()
df.agg(collect_list('age')).show()
>>> df.select(collect_list('age')).show()
+-----------------+
|collect_list(age)|
+-----------------+
|[7, 12, 6, 8, 10]|
+-----------------+
>>> df.agg(collect_list('age')).show()
+-----------------+
|collect_list(age)|
+-----------------+
|[7, 12, 6, 8, 10]|
+-----------------+