PySpark : Generate a sequence number based on a specific order of the DataFrame

PySpark @ Freshers.in

You can also use the row_number() function with over() clause to generate a sequence number based on a specific order of the DataFrame. In PySpark, you can use the row_number() function, which is part of the Window functions, to generate a unique row number for each row in a DataFrame. Here is an example of how to use the row_number() function:

from pyspark.sql import Window
from pyspark.sql.functions import row_number

# Create a DataFrame
data = [
("Peter Sam", 11), 
("Twinkle John", 23),
("Marrie Bob", 33),
("Sharone Rode", 43),
("Baby Jonnah", 24), 
("Bobby Robert", 53),
 ("Shakewille Jane", 39)
]
df = spark.createDataFrame(data, ["name", "age"])
df.show()
Output
+---------------+---+
|           name|age|
+---------------+---+
|      Peter Sam| 11|
|   Twinkle John| 23|
|     Marrie Bob| 33|
|   Sharone Rode| 43|
|    Baby Jonnah| 24|
|   Bobby Robert| 53|
|Shakewille Jane| 39|
+---------------+---+
# Create a Window specification
windowSpec = Window.partitionBy().orderBy("age")

# Add a new column with the row number
df = df.withColumn("row_number", row_number().over(windowSpec))

# Show the DataFrame
df.show()
Output
+---------------+---+----------+
|           name|age|row_number|
+---------------+---+----------+
|      Peter Sam| 11|         1|
|   Twinkle John| 23|         2|
|    Baby Jonnah| 24|         3|
|     Marrie Bob| 33|         4|
|Shakewille Jane| 39|         5|
|   Sharone Rode| 43|         6|
|   Bobby Robert| 53|         7|
+---------------+---+----------+

Here, the windowSpec is defined by partitioning the DataFrame by nothing, and ordering it by the “age” column. The row_number() function is then applied to this window specification using the over() method, and the result is added as a new column called “row_number”.

You can also partition the DataFrame by multiple columns to get the row number for that specific partition.

windowSpec = Window.partitionBy("name").orderBy("age")
df = df.withColumn("row_number", row_number().over(windowSpec))
This will give the row number for that specific partition.
+---------------+---+----------+
|           name|age|row_number|
+---------------+---+----------+
|    Baby Jonnah| 24|         1|
|      Peter Sam| 11|         1|
|Shakewille Jane| 39|         1|
|   Twinkle John| 23|         1|
|     Marrie Bob| 33|         1|
|   Bobby Robert| 53|         1|
|   Sharone Rode| 43|         1|
+---------------+---+----------+
Author: user

Leave a Reply