PySpark : Find the minimum value in an array column of a DataFrame

PySpark @ Freshers.in

pyspark.sql.functions.array_min

The array_min function is a built-in function in Pyspark that finds the minimum value in an array column of a DataFrame. Its a collection function and it returns the minimum value of the array.

Here is an example of how to use array_min:

from pyspark.sql.functions import array_min
# Create a DataFrame with an array column
data = [([1, 2, 3],), ([4, 5, 6],), ([7, 8, 9],)]
df = spark.createDataFrame(data, ["numbers"])
# Find the minimum value in the array column
df.select(array_min("numbers").alias("min_number")).show()

Result

+----------+
|min_number|
+----------+
|         1|
|         4|
|         7|
+----------+

The advantages of using array_min are:

  1. It is a built-in function in Pyspark, so it does not require any additional imports or dependencies.
  2. It is easy to use, as it takes only one argument, the name of the array column.
  3. It returns the minimum value of an array column in a DataFrame, making it a simple and efficient way to find the minimum value in a large dataset.
  4. It can also be combined with other Pyspark functions for more complex data processing tasks.

This only works with an array column and can not be used to find min value of other columns.

 

Author: user

Leave a Reply