DataFrame operations to retrieve the first element in a group in PySpark

user December 5, 2023

PySpark’s first function is a part of the pyspark.sql.functions module. It is used in DataFrame operations to retrieve the first element in a group after data has been grouped using the groupBy function. This function is particularly useful in scenarios where you need to extract an initial record from each group in your dataset. PySpark’s first function is a simple yet powerful tool for data extraction in grouped datasets.

Why Use the First Function?

The first function is invaluable when you need to simplify large datasets by extracting key pieces of data. It is often used in summarizing data, reporting, and analytics, where the initial or representative data of each group is critical for insights.

Practical Example with Real Data

Scenario

To demonstrate the use of the first function in PySpark, we will consider a simple dataset containing names and associated scores.

Creating a DataFrame: We will create a DataFrame with names and scores.

from pyspark.sql import SparkSession
from pyspark.sql.functions import first
spark = SparkSession.builder.appName("FirstExample").getOrCreate()
data = [("Sachin", 95), ("Manju", 88), ("Ram", 76), 
        ("Raju", 89), ("David", 92), ("Freshers_in", 65), ("Wilson", 78)]
columns = ["Name", "Score"]
df = spark.createDataFrame(data, columns)

Applying the First Function: We will use the first function to extract the first score from our dataset.

df_grouped = df.groupBy().agg(first("Score").alias("FirstScore"))
df_grouped.show()

Output

+----------+
|FirstScore|
+----------+
|        95|
+----------+

The output of the above code will display the first score from the dataset.

Spark important urls to refer

Post Views: 7

Author: user

DataFrame operations to retrieve the first element in a group in PySpark

Why Use the First Function?

Practical Example with Real Data

Scenario

Trending

Recent Posts

Featured Posts – Slider Widget

How PARTITION BY Works in Snowflake, and SQL in general

Stash a specific file using Git

Prevent your computer from locking : Python to simulate mouse movements

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Most Viewed Posts

Why Use the First Function?

Practical Example with Real Data

Scenario

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget