PySpark :Remove any key-value pair that has a key present in another RDD [subtractByKey]

PySpark @ Freshers.in

In this article, we will explore the use of subtractByKey in PySpark, a transformation that returns an RDD consisting of key-value pairs from one RDD by removing any pair that has a key present in another RDD. We will provide a detailed example using hardcoded values as input.

First, let’s create two PySpark RDDs

#Using subtractByKey in PySpark @Freshers.in
from pyspark import SparkContext
sc = SparkContext("local", "subtractByKey @ Freshers.in ")
data1 = [("America", 1), ("Botswana", 2), ("Costa Rica", 3), ("Denmark", 4), ("Egypt", 5)]
data2 = [("Botswana", 20), ("Denmark", 40), ("Finland", 60)]

rdd1 = sc.parallelize(data1)
rdd2 = sc.parallelize(data2)

Using subtractByKey

Now, let’s use the subtractByKey method to create a new RDD by removing key-value pairs from rdd1 that have keys present in rdd2:

result_rdd = rdd1.subtractByKey(rdd2)
result_data = result_rdd.collect()
print("Result of subtractByKey:")
for element in result_data:
    print(element)

In this example, we used the subtractByKey method on rdd1 and passed rdd2 as an argument. The method returns a new RDD containing key-value pairs from rdd1 after removing any pair with a key present in rdd2. The collect method is then used to retrieve the results.

Interpreting the Results

Result of subtractByKey:
('Costa Rica', 3)
('America', 1)
('Egypt', 5)

The resulting RDD contains key-value pairs from rdd1 with the key-value pairs having keys “Botswana” and “Denmark” removed, as these keys are present in rdd2.

In this article, we explored the use of subtractByKey in PySpark, a transformation that returns an RDD consisting of key-value pairs from one RDD by removing any pair that has a key present in another RDD. We provided a detailed example using hardcoded values as input, showcasing how to create two RDDs with key-value pairs, use the subtractByKey method, and interpret the results. subtractByKey can be useful in various scenarios, such as filtering out unwanted data based on keys or performing set-like operations on key-value pair RDDs.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply