PySpark : Calculate the percent rank of a set of values in a DataFrame column using PySpark[percent_rank]

PySpark @ Freshers.in

pyspark.sql.functions.percent_rank

PySpark provides a percent_rank function as part of the pyspark.sql.functions module, which is used to calculate the percent rank of a set of values in a DataFrame column. The percent rank is a value between 0 and 1 that indicates the relative rank of a value within a set of values. The lower the value of percent rank, the lower the rank of the value.

Here is an example to demonstrate the use of the percent_rank function in PySpark:

from pyspark.sql import Window
from pyspark.sql import SparkSession
from pyspark.sql.functions import percent_rank

# Start a SparkSession
spark = SparkSession.builder.appName("PercentRank Example @ Freshers.in").getOrCreate()
# Create a DataFrame
data = [(1,), (2,), (3,), (4,), (5,)]
df = spark.createDataFrame(data, ["value"])

# Use the percent_rank function to calculate the percent rank of the values in the DataFrame
df = df.select("value", percent_rank().over(Window.orderBy("value")).alias("percent_rank"))
df.show()

Output

+-----+------------+
|value|percent_rank|
+-----+------------+
|    1|         0.0|
|    2|        0.25|
|    3|         0.5|
|    4|        0.75|
|    5|         1.0|
+-----+------------+

As you can see, the percent_rank function has calculated the percent rank of each value in the DataFrame. The values are sorted in ascending order, and the percent rank of each value is calculated based on its position within the set of values. The lower the value of percent rank, the lower the rank of the value.

The percent_rank function is especially useful when working with large datasets, as it provides a quick and efficient way to determine the relative rank of a set of values. Additionally, the function can be used in combination with other functions in the pyspark.sql.functions module to perform more complex operations on DataFrames.

In conclusion, the percent_rank function in PySpark is a valuable tool for working with data in Spark dataframes. Whether you need to determine the relative rank of a set of values or perform more complex operations, the pyspark.sql.functions module provides the tools you need to get the job done.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply