‘map’ in PySpark is a transformation operation that allows you to apply a function to each element in an RDD (Resilient Distributed Dataset), which is the basic data structure in PySpark. The function takes a single element as input and returns a single output.
The result of the map operation is a new RDD where each element is the result of applying the function to the corresponding element in the original RDD.
Example:
Suppose you have an RDD of integers, and you want to multiply each element by 2. You can use the map transformation as follows:
rdd = sc.parallelize([1, 2, 3, 4, 5])
result = rdd.map(lambda x: x * 2)
result.collect()
The output of this code will be [2, 4, 6, 8, 10]. The map operation takes a lambda function (or any other function) that takes a single integer as input and returns its double. The collect action is used to retrieve the elements of the RDD back to the driver program as a list.
Spark important urls to refer