PySpark : Returning the input values, pivoted into an ARRAY

PySpark @ Freshers.in

To pivot data in PySpark into an array, you can use a combination of groupBy, pivot, and collect_list functions. The groupBy function is used to group the DataFrame using the specified columns, pivot can be used to pivot a column of the DataFrame and perform a specified aggregation, and collect_list function collects and returns a list of non-unique elements.

Below is an example where I create a DataFrame, and then pivot the ‘value’ column into an array based on ‘id’ and ‘type’.

from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list
# Spark session
spark = SparkSession.builder.appName('pivot_to_array').getOrCreate()
# Creating  DataFrame
data = [("1", "type1", "value1"), ("1", "type2", "value2"), ("2", "type1", "value3"), ("2", "type2", "value4")]
df = spark.createDataFrame(data, ["id", "type", "value"])
# DataFrame
df.show()

Result

+---+-----+------+
| id| type| value|
+---+-----+------+
|  1|type1|value1|
|  1|type2|value2|
|  2|type1|value3|
|  2|type2|value4|
+---+-----+------+
# Pivot and collect values into array
df_pivot = df.groupBy("id").pivot("type").agg(collect_list("value"))
# Pivoted DataFrame
df_pivot.show()

Final Output

In this example, groupBy(“id”) groups the DataFrame by ‘id’, pivot(“type”) pivots the ‘type’ column, and agg(collect_list(“value”)) collects the ‘value’ column into an array for each group. The resulting DataFrame will have one row for each unique ‘id’, and a column for each unique ‘type’, with the values in these columns being arrays of the corresponding ‘value’ entries.

‘collect_list’ collects all values including duplicates. If you want to collect only unique values, use ‘collect_set’ instead.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply