Convert data from the PySpark DataFrame columns to Row format or get elements in columns in row

PySpark @ Freshers.in

pyspark.sql.functions.collect_list(col)

This is an aggregate function and returns a list of objects with duplicates. To retrieve the data from the PySpark DataFrame columns and return the results in Row format, use the collect_list() method. Because the order of the gathered results depends on the order of the rows, which may change after a shuffle, the function is non-deterministic.

Sample data set ( freshers_sample.csv) 

sino,name,date,age,status
1,Sam Peter,08-07-2022,7,1
2,John Manual,15-06-2022,12,2
3,Eric Burst,08-07-2022,6,2
4,Tim Moris,08-09-2022,8,2
5,Jack Berry,08-11-2022,10,3

Sample code with collect_list

from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list
spark = SparkSession.builder.appName("collect_list example ").getOrCreate()
df = spark.read.csv('D:\\Learning\\PySpark\\freshers_sample.csv',header=True,inferSchema=True)
df.show()
df.printSchema()
+----+-----------+----------+---+------+
|sino|       name|      date|age|status|
+----+-----------+----------+---+------+
|   1|  Sam Peter|08-07-2022|  7|     1|
|   2|John Manual|15-06-2022| 12|     2|
|   3| Eric Burst|08-07-2022|  6|     2|
|   4|  Tim Moris|08-09-2022|  8|     2|
|   5| Jack Berry|08-11-2022| 10|     3|
+----+-----------+----------+---+------+

root
 |-- sino: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- date: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- status: integer (nullable = true)
df.select(collect_list('age')).show()
df.agg(collect_list('age')).show()
>>> df.select(collect_list('age')).show()
+-----------------+
|collect_list(age)|
+-----------------+
|[7, 12, 6, 8, 10]|
+-----------------+

>>> df.agg(collect_list('age')).show()
+-----------------+
|collect_list(age)|
+-----------------+
|[7, 12, 6, 8, 10]|
+-----------------+
Author: user

Leave a Reply