PySpark : Generates a unique and increasing 64-bit integer ID for each row in a DataFrame

user January 29, 2023 Leave a Comment

pyspark.sql.functions.monotonically_increasing_id

A column that produces 64-bit integers with a monotonic increase. The created ID is assured to be both singular and monotonically rising, but not sequential. The partition ID is currently stored in the upper 31 bits. While the record number inside each partition is currently stored in the lower 33 bits. The data frame is assumed to have fewer than 1 billion partitions, each with fewer than 8 billion records.

In PySpark, you can generate a sequence number using the monotonically_increasing_id() function. This generates a unique and increasing 64-bit integer ID for each row in a DataFrame. The ID is unique within the DataFrame, but may not be unique across different DataFrames or sessions.

Here is an example of how to use monotonically_increasing_id(), to add a sequence number column to a DataFrame:

Sample Code

from pyspark.sql.functions import monotonically_increasing_id

# Create a DataFrame
data = [("Peter Sam", 11), ("Twinkle John", 23), ("Marrie Bob", 33),("Sharone Rode", 43)]
df = spark.createDataFrame(data, ["full_name", "age"])

# Add a sequence number column
df = df.withColumn("seq_num", monotonically_increasing_id())

# Show the DataFrame
df.show()

Result

+------------+---+-----------+
|   full_name|age|    seq_num|
+------------+---+-----------+
|   Peter Sam| 11|          0|
|Twinkle John| 23| 8589934592|
|  Marrie Bob| 33|17179869184|
|Sharone Rode| 43|25769803776|
+------------+---+-----------+

Spark important urls to refer

Post Views: 80

Author: user

PySpark : Generates a unique and increasing 64-bit integer ID for each row in a DataFrame

pyspark.sql.functions.monotonically_increasing_id

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Most Viewed Posts

pyspark.sql.functions.monotonically_increasing_id

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget