pyspark.sql.functions.greatest
In the vast universe of PySpark’s functionalities, there exists a function that often becomes the unsung hero when dealing with comparison operations: the pyspark.sql.functions.greatest. As its name suggests, this function evaluates a list of column names and seamlessly returns the greatest value.
While Python offers numerous ways to find the maximum value from a list, greatest is tailor-made for PySpark DataFrames. It allows direct column-wise comparison, ensuring optimized and distributed computations in big data scenarios. PySpark’s pyspark.sql.functions.greatest isn’t just a function; it’s a testament to PySpark’s capability to handle and streamline large-scale data operations.
Before diving in, ensure you’ve installed PySpark and its required dependencies. With that set, let’s immerse ourselves in a hands-on exercise using hardcoded data:
PySpark DataFrame operations and Column-wise Max in PySpark
from pyspark.sql import SparkSession
from pyspark.sql.functions import greatest
# Initialize Spark session
spark = SparkSession.builder.appName("greatest_demo @ Freshers.in").getOrCreate()
# Create a DataFrame with hardcoded data
data = [("Sachin", 85, 90, 88), ("Sangeeth", 92, 87, 93), ("Rakesh", 88, 89, 91)]
df = spark.createDataFrame(data, ["Name", "Math", "Physics", "Chemistry"])
# Determine the highest marks for each student
df_with_greatest = df.withColumn("Highest_Mark", greatest("Math", "Physics", "Chemistry"))
# Display the results
df_with_greatest.show()
When executed, this script unveils a DataFrame showcasing each student’s name, their marks, and their highest score among the three subjects.
+--------+----+-------+---------+------------+
| Name|Math|Physics|Chemistry|Highest_Mark|
+--------+----+-------+---------+------------+
| Sachin| 85| 90| 88| 90|
|Sangeeth| 92| 87| 93| 93|
| Rakesh| 88| 89| 91| 91|
+--------+----+-------+---------+------------+
Spark important urls to refer