PySpark : Returning the input values, pivoted into an ARRAY

user June 15, 2023 Leave a Comment

To pivot data in PySpark into an array, you can use a combination of groupBy, pivot, and collect_list functions. The groupBy function is used to group the DataFrame using the specified columns, pivot can be used to pivot a column of the DataFrame and perform a specified aggregation, and collect_list function collects and returns a list of non-unique elements.

Below is an example where I create a DataFrame, and then pivot the ‘value’ column into an array based on ‘id’ and ‘type’.

from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list
# Spark session
spark = SparkSession.builder.appName('pivot_to_array').getOrCreate()
# Creating  DataFrame
data = [("1", "type1", "value1"), ("1", "type2", "value2"), ("2", "type1", "value3"), ("2", "type2", "value4")]
df = spark.createDataFrame(data, ["id", "type", "value"])
# DataFrame
df.show()

Result

+---+-----+------+
| id| type| value|
+---+-----+------+
|  1|type1|value1|
|  1|type2|value2|
|  2|type1|value3|
|  2|type2|value4|
+---+-----+------+

# Pivot and collect values into array
df_pivot = df.groupBy("id").pivot("type").agg(collect_list("value"))
# Pivoted DataFrame
df_pivot.show()

Final Output

In this example, groupBy(“id”) groups the DataFrame by ‘id’, pivot(“type”) pivots the ‘type’ column, and agg(collect_list(“value”)) collects the ‘value’ column into an array for each group. The resulting DataFrame will have one row for each unique ‘id’, and a column for each unique ‘type’, with the values in these columns being arrays of the corresponding ‘value’ entries.

‘collect_list’ collects all values including duplicates. If you want to collect only unique values, use ‘collect_set’ instead.

Spark important urls to refer

Post Views: 7

Author: user

PySpark : Returning the input values, pivoted into an ARRAY

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget