In this article we will discuss on how we can trigger a PySpark Job running on a AWS EMR from an Airflow hosted on different instance. For initial set up you need to create a permanent connection between the airflow server and EMR ( Where your Spark Server ) . You can achieve the permanent connection using SSH Key in ~/.ssh folder of the EMR
How to generate the Key
For 2048-bit RSA: ssh-keygen -t rsa -b 2048 -C "<comment>" For ED25519: ssh-keygen -t ed25519 -C "<comment>"
Once the Connection is successfully established you can use BashOperator to trigger the Spark Job. The first question that comes in your mind is whether we can get the Spark Logs in the Airfolw. Answer is “Yes” we can get using the bellow method .
For executing we need these
- Connection between the Airflow server and EMR => We established above
- Spark Code in EMR
- Shell script to run the spark code in Airflow server path.
- Airflow DAG(python code) to run the Shell Script
Spark Code in EMR server (file name : bash_freshers_sample_spark.py)
Shell script to run the spark code in Airflow server path. (file name : freshers_in_spark.sh)
The first three lines should go in a single line( For easy viewability I made as 3 lines [ssh command should go in one line …ssh to till _spark.py])
Airflow DAG(python code) in Airflow server to run the Shell Script (file name : freshers_spark_job.py)
Reference