PySpark :Remove any key-value pair that has a key present in another RDD [subtractByKey]

user April 13, 2023 Leave a Comment

In this article, we will explore the use of subtractByKey in PySpark, a transformation that returns an RDD consisting of key-value pairs from one RDD by removing any pair that has a key present in another RDD. We will provide a detailed example using hardcoded values as input.

First, let’s create two PySpark RDDs

#Using subtractByKey in PySpark @Freshers.in
from pyspark import SparkContext
sc = SparkContext("local", "subtractByKey @ Freshers.in ")
data1 = [("America", 1), ("Botswana", 2), ("Costa Rica", 3), ("Denmark", 4), ("Egypt", 5)]
data2 = [("Botswana", 20), ("Denmark", 40), ("Finland", 60)]

rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)

Using subtractByKey

Now, let’s use the subtractByKey method to create a new RDD by removing key-value pairs from rdd1 that have keys present in rdd2:

result_rdd = rdd1.subtractByKey(rdd2)
result_data = result_rdd.collect()
print("Result of subtractByKey:")
for element in result_data:
    print(element)

In this example, we used the subtractByKey method on rdd1 and passed rdd2 as an argument. The method returns a new RDD containing key-value pairs from rdd1 after removing any pair with a key present in rdd2. The collect method is then used to retrieve the results.

Interpreting the Results

Result of subtractByKey:
('Costa Rica', 3)
('America', 1)
('Egypt', 5)

The resulting RDD contains key-value pairs from rdd1 with the key-value pairs having keys “Botswana” and “Denmark” removed, as these keys are present in rdd2.

In this article, we explored the use of subtractByKey in PySpark, a transformation that returns an RDD consisting of key-value pairs from one RDD by removing any pair that has a key present in another RDD. We provided a detailed example using hardcoded values as input, showcasing how to create two RDDs with key-value pairs, use the subtractByKey method, and interpret the results. subtractByKey can be useful in various scenarios, such as filtering out unwanted data based on keys or performing set-like operations on key-value pair RDDs.

Spark important urls to refer

Post Views: 52

Author: user

PySpark :Remove any key-value pair that has a key present in another RDD [subtractByKey]

Using subtractByKey

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Security Features of Snowflake

Snowflake Savings: Mastering Cost Optimization Strategies

Snowflake’s Snowpipe to ingest streaming data from an AWS S3 bucket

Most Viewed Posts

Using subtractByKey

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget