In PySpark, the dense_rank function is used to assign a rank to each row within a result set, based on the values of one or more columns. It is a window function that assigns a unique rank to each unique value within a result set, with no gaps in the ranking values.
The dense_rank function is a window function that assigns a rank to each row within a result set, based on the values in one or more columns. The rank assigned is unique and dense, meaning that there are no gaps in the sequence of rank values. For example, if there are three rows with the same value in the column used for ranking, they will be assigned the same rank, and the next row will be assigned the rank that is three greater than the previous rank. The dense_rankĀ function is typically used in conjunction with an ORDER BY clause to sort the result set by the column(s) used for ranking.
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.functions import dense_rank, col
spark = SparkSession.builder.appName("dense_rank").getOrCreate()
data = [("Peter John", 25),("Wisdon Mike", 30),("Sarah Johns", 25),("Bob Beliver", 22),("Lucas Marget", 30)]
df = spark.createDataFrame(data, ["name", "age"])
df2 = df.select("name", "age", dense_rank().\
over(Window.partitionBy("age").\
orderBy("name")).\
alias("rank"))
df2.show()
+------------+---+----+
| name|age|rank|
+------------+---+----+
| Bob Beliver| 22| 1|
| Peter John| 25| 1|
| Sarah Johns| 25| 2|
|Lucas Marget| 30| 1|
| Wisdon Mike| 30| 2|
+------------+---+----+
This means that Peter John and Sarah Johns have the same age with Peter John having 1st rank and Sarah Johns having 2nd rank.