Convert data from the PySpark DataFrame columns to Row format or get elements in columns in row

user November 30, 2022 Leave a Comment

pyspark.sql.functions.collect_list(col)

This is an aggregate function and returns a list of objects with duplicates. To retrieve the data from the PySpark DataFrame columns and return the results in Row format, use the collect_list() method. Because the order of the gathered results depends on the order of the rows, which may change after a shuffle, the function is non-deterministic.

Sample data set ( freshers_sample.csv)

sino,name,date,age,status
1,Sam Peter,08-07-2022,7,1
2,John Manual,15-06-2022,12,2
3,Eric Burst,08-07-2022,6,2
4,Tim Moris,08-09-2022,8,2
5,Jack Berry,08-11-2022,10,3

Sample code with collect_list

from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list
spark = SparkSession.builder.appName("collect_list example ").getOrCreate()
df = spark.read.csv('D:\\Learning\\PySpark\\freshers_sample.csv',header=True,inferSchema=True)
df.show()
df.printSchema()

+----+-----------+----------+---+------+
|sino|       name|      date|age|status|
+----+-----------+----------+---+------+
|   1|  Sam Peter|08-07-2022|  7|     1|
|   2|John Manual|15-06-2022| 12|     2|
|   3| Eric Burst|08-07-2022|  6|     2|
|   4|  Tim Moris|08-09-2022|  8|     2|
|   5| Jack Berry|08-11-2022| 10|     3|
+----+-----------+----------+---+------+

root
 |-- sino: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- date: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- status: integer (nullable = true)

df.select(collect_list('age')).show()
df.agg(collect_list('age')).show()

>>> df.select(collect_list('age')).show()
+-----------------+
|collect_list(age)|
+-----------------+
|[7, 12, 6, 8, 10]|
+-----------------+

>>> df.agg(collect_list('age')).show()
+-----------------+
|collect_list(age)|
+-----------------+
|[7, 12, 6, 8, 10]|
+-----------------+

Post Views: 293

Author: user

Convert data from the PySpark DataFrame columns to Row format or get elements in columns in row

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget