In PySpark, the broadcast variable is used to cache a read-only variable on all the worker nodes, which can be used in tasks running on those nodes. This helps to improve the performance of Spark jobs by reducing the amount of data that needs to be shipped over the network.
Here is an example of how to use broadcast variables in PySpark:
from pyspark import SparkContext from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName("Broadcast Example").getOrCreate() # Create an RDD data = [1, 2, 3, 4, 5] rdd = spark.sparkContext.parallelize(data) # Create a broadcast variable broadcastVar = spark.sparkContext.broadcast(10) # Use the broadcast variable in a task def addBroadcast(x): return x + broadcastVar.value # Perform the transformation and collect the results results = rdd.map(addBroadcast).collect() print(results)
In this example, we create an RDD with some data, and then create a broadcast variable called
broadcastVar with the value 10. We then define a function
addBroadcast that takes an input value and adds the value of the broadcast variable to it. Finally, we use the
map transformation to apply the
addBroadcast function to each element of the RDD, and then use the
collect action to retrieve the results. The output of the example is [11, 12, 13, 14, 15], showing that the broadcast variable was used in the task to add 10 to each element of the RDD.
Broadcast variables are useful when you have a large read-only variable that is used in multiple tasks. By caching the variable on the worker nodes, you can avoid the overhead of shipping the variable over the network multiple times. Additionally, broadcast variables are automatically garbage collected once it is no longer needed.
It is worth noting that, when creating a broadcast variable, it will be created on the driver node, and then sent to all the worker nodes. This can use a significant amount of network bandwidth and memory, so it should be used judiciously, especially when working with large variables.
Spark important urls to refer