PySpark : Find the maximum value in an array column of a DataFrame

PySpark @ Freshers.in

pyspark.sql.functions.array_max

The array_max function is a built-in function in Pyspark that finds the maximum value in an array column of a DataFrame. Its a collection function and  returns the maximum value of the array.

Here is an example of how to use array_max:

from pyspark.sql.functions import array_max

# Create a DataFrame with an array column
data = [([1, 2, 3],), ([4, 5, 6],), ([7, 8, 9],)]
df = spark.createDataFrame(data, ["numbers"])

# Find the maximum value in the array column
df.select(array_max("numbers").alias("max_number")).show()

Result

+----------+
|max_number|
+----------+
|         3|
|         6|
|         9|
+----------+

The advantages of using array_max are:

  1. It is a built-in function in Pyspark, so it does not require any additional imports or dependencies.
  2. It is easy to use, as it takes only one argument, the name of the array column.
  3. It returns the maximum value of an array column in a DataFrame, making it a simple and efficient way to find the maximum value in a large dataset.
  4. It can also be combined with other Pyspark functions for more complex data processing tasks.

This only works with an array column and can not be used to find max value of other columns.

Author: user

Leave a Reply