Distinction Between dense_rank() and row_number() in PySpark

PySpark @ Freshers.in

PySpark, a Python library for Apache Spark, offers a powerful set of functions for data manipulation and analysis. Two commonly used functions for ranking data in PySpark are dense_rank() and row_number(). While they may seem similar at first glance, understanding the differences between them is crucial for precise data analytics. In this article, we will explore the distinctions between dense_rank() and row_number(), providing practical examples and detailed output comparisons.

Understanding dense_rank() and row_number()

dense_rank()

The dense_rank() function assigns a rank to each row within a result set, with duplicate values receiving the same rank. If two rows have the same values and are assigned rank 1, the next rank will be 2 (no rank of 1.5, 1.75, etc.).

row_number()

In contrast, the row_number() function assigns a unique integer to each row within the result set, regardless of duplicate values. Each row will receive a distinct rank.

Practical Examples

Let’s illustrate the difference between dense_rank() and row_number() with practical examples.

Example 1: Basic Usage

Suppose we have a PySpark DataFrame df with the following data:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("rank_example").getOrCreate()
data = [(1, "Sachin", 100),
        (2, "Mannu", 95),
        (3, "Raju", 100),
        (4, "David", 95),
        (5, "Justin", 90)]
columns = ["id", "name", "score"]
df = spark.createDataFrame(data, columns)

Using dense_rank()

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
window_spec = Window.orderBy(df["score"].desc())
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()

Output:

+---+------+-----+----------+
| id|  name|score|dense_rank|
+---+------+-----+----------+
|  1|Sachin|  100|         1|
|  3|  Raju|  100|         1|
|  2| Mannu|   95|         2|
|  4| David|   95|         2|
|  5|Justin|   90|         3|
+---+------+-----+----------+

Using row_number()

from pyspark.sql.functions import row_number
df.withColumn("row_number", row_number().over(window_spec)).show()
Output
+---+------+-----+----------+
| id|  name|score|row_number|
+---+------+-----+----------+
|  1|Sachin|  100|         1|
|  3|  Raju|  100|         2|
|  2| Mannu|   95|         3|
|  4| David|   95|         4|
|  5|Justin|   90|         5|
+---+------+-----+----------+
Author: user