PySpark, a Python library for Apache Spark, offers a powerful set of functions for data manipulation and analysis. Two commonly used functions for ranking data in PySpark are dense_rank()
and row_number()
. While they may seem similar at first glance, understanding the differences between them is crucial for precise data analytics. In this article, we will explore the distinctions between dense_rank()
and row_number()
, providing practical examples and detailed output comparisons.
Understanding dense_rank()
and row_number()
dense_rank()
The dense_rank()
function assigns a rank to each row within a result set, with duplicate values receiving the same rank. If two rows have the same values and are assigned rank 1, the next rank will be 2 (no rank of 1.5, 1.75, etc.).
row_number()
In contrast, the row_number()
function assigns a unique integer to each row within the result set, regardless of duplicate values. Each row will receive a distinct rank.
Practical Examples
Let’s illustrate the difference between dense_rank()
and row_number()
with practical examples.
Example 1: Basic Usage
Suppose we have a PySpark DataFrame df
with the following data:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("rank_example").getOrCreate()
data = [(1, "Sachin", 100),
(2, "Mannu", 95),
(3, "Raju", 100),
(4, "David", 95),
(5, "Justin", 90)]
columns = ["id", "name", "score"]
df = spark.createDataFrame(data, columns)
Using dense_rank()
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
window_spec = Window.orderBy(df["score"].desc())
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
Output:
+---+------+-----+----------+
| id| name|score|dense_rank|
+---+------+-----+----------+
| 1|Sachin| 100| 1|
| 3| Raju| 100| 1|
| 2| Mannu| 95| 2|
| 4| David| 95| 2|
| 5|Justin| 90| 3|
+---+------+-----+----------+
Using row_number()
from pyspark.sql.functions import row_number
df.withColumn("row_number", row_number().over(window_spec)).show()
+---+------+-----+----------+
| id| name|score|row_number|
+---+------+-----+----------+
| 1|Sachin| 100| 1|
| 3| Raju| 100| 2|
| 2| Mannu| 95| 3|
| 4| David| 95| 4|
| 5|Justin| 90| 5|
+---+------+-----+----------+
Spark important urls to refer