In this article, we will explore the use of zipWithUniqueId in PySpark, a method that assigns a unique identifier to each element in an RDD. We will provide a detailed example using hardcoded values as input.
Prerequisites
- Python 3.7 or higher
- PySpark library
- Java 8 or higher
First, let’s create a PySpark RDD
#Using zipWithUniqueId in PySpark at Freshers.in
from pyspark import SparkContext
sc = SparkContext("local", "zipWithUniqueId @ Freshers.in")
data = ["America", "Botswana", "Costa Rica", "Denmark", "Egypt"]
rdd = sc.parallelize(data)
Using zipWithUniqueId
Now, let’s use the zipWithUniqueId method to assign a unique identifier to each element in the RDD:
unique_id_rdd = rdd.zipWithUniqueId()
unique_id_data = unique_id_rdd.collect()
print("Data with Unique IDs:")
for element in unique_id_data:
print(element)
In this example, we used the zipWithUniqueId method on the RDD, which creates a new RDD containing tuples of the original elements and their corresponding unique identifier. The collect method is then used to retrieve the results.
Interpreting the Results
Data with Unique IDs:
('America', 0)
('Botswana', 1)
('Costa Rica', 2)
('Denmark', 3)
('Egypt', 4)
Spark important urls to refer