When you submit a Spark application using the spark-submit
command, a series of steps occur to start and execute the application. Here is an overview of the main steps that happen after you submit a Spark application:
- Cluster Manager: The first step is to determine which cluster manager will be used to execute the application. Spark supports multiple cluster managers, such as YARN, Mesos, and the standalone Spark cluster manager. Depending on the configuration specified in the
spark-submit
command or the application’s properties, the appropriate cluster manager is selected. - Driver Program: The driver program, which is the program that defines the application’s main logic and creates the SparkContext, is launched on the cluster. The driver program is responsible for creating the RDDs and DataFrames, applying transformations and actions, and coordinating the execution of the tasks on the cluster.
- Task Scheduling: The driver program communicates with the cluster manager to schedule tasks on the cluster’s worker nodes. The cluster manager is responsible for allocating resources and managing the scheduling of tasks.
- Task Execution: The tasks are executed on the worker nodes. Each task operates on a partition of the data and applies the specified transformations and actions.
- Shuffle: If a transformation operation that requires data shuffling, such as
groupByKey
orjoin
, is called, the driver program coordinates the execution of the shuffle step. The shuffle step is responsible for redistributing the data across the worker nodes to ensure that the data is properly partitioned for the next step of the computation. - Results and Error Handling: Once the tasks have completed, the driver program collects the results and determines if any errors occurred during the execution of the tasks. If there are any errors, the driver program will handle them accordingly, such as retrying failed tasks or stopping the application if the error is unrecoverable.
- Application Completion: After the application completes, the driver program exits and the cluster manager releases the resources that were allocated to the application.
There are many other things happening in the background such as the task serialization, task deserialization, caching, data locality, etc.
This process can vary depending on the specific cluster manager being used, as well as the configuration and settings of the Spark application and cluster.
Spark important urls to refer