Distinction Between dense_rank() and row_number() in PySpark

user January 31, 2024

PySpark, a Python library for Apache Spark, offers a powerful set of functions for data manipulation and analysis. Two commonly used functions for ranking data in PySpark are dense_rank() and row_number(). While they may seem similar at first glance, understanding the differences between them is crucial for precise data analytics. In this article, we will explore the distinctions between dense_rank() and row_number(), providing practical examples and detailed output comparisons.

Understanding `dense_rank()` and `row_number()`

`dense_rank()`

The dense_rank() function assigns a rank to each row within a result set, with duplicate values receiving the same rank. If two rows have the same values and are assigned rank 1, the next rank will be 2 (no rank of 1.5, 1.75, etc.).

`row_number()`

In contrast, the row_number() function assigns a unique integer to each row within the result set, regardless of duplicate values. Each row will receive a distinct rank.

Practical Examples

Let’s illustrate the difference between dense_rank() and row_number() with practical examples.

Example 1: Basic Usage

Suppose we have a PySpark DataFrame df with the following data:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("rank_example").getOrCreate()
data = [(1, "Sachin", 100),
        (2, "Mannu", 95),
        (3, "Raju", 100),
        (4, "David", 95),
        (5, "Justin", 90)]
columns = ["id", "name", "score"]
df = spark.createDataFrame(data, columns)

Using `dense_rank()`

from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
window_spec = Window.orderBy(df["score"].desc())
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()

Output:

+---+------+-----+----------+
| id|  name|score|dense_rank|
+---+------+-----+----------+
|  1|Sachin|  100|         1|
|  3|  Raju|  100|         1|
|  2| Mannu|   95|         2|
|  4| David|   95|         2|
|  5|Justin|   90|         3|
+---+------+-----+----------+

Using `row_number()`

from pyspark.sql.functions import row_number
df.withColumn("row_number", row_number().over(window_spec)).show()

Output

+---+------+-----+----------+
| id|  name|score|row_number|
+---+------+-----+----------+
|  1|Sachin|  100|         1|
|  3|  Raju|  100|         2|
|  2| Mannu|   95|         3|
|  4| David|   95|         4|
|  5|Justin|   90|         5|
+---+------+-----+----------+

Spark important urls to refer

Post Views: 4

Author: user

Distinction Between dense_rank() and row_number() in PySpark

Understanding `dense_rank()` and `row_number()`

`dense_rank()`

`row_number()`

Practical Examples

Example 1: Basic Usage

Using `dense_rank()`

Using `row_number()`

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

Understanding dense_rank() and row_number()

dense_rank()

row_number()

Practical Examples

Example 1: Basic Usage

Using dense_rank()

Using row_number()

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget

Understanding `dense_rank()` and `row_number()`

`dense_rank()`

`row_number()`

Using `dense_rank()`

Using `row_number()`