To connect PySpark to Google BigQuery, you will need to have the Google Cloud SDK and the BigQuery connector for PySpark installed. You can install the Google Cloud SDK by following the instructions provided by Google. Once the SDK is installed, you can use the command gcloud components install bigquery-connector-python to install the BigQuery connector for PySpark.
In order to connect to BigQuery from AWS EMR, you will need to set up authentication using a service account. A service account is a special type of Google account that belongs to your application or a virtual machine (VM), instead of to an individual end user.
To set up authentication using a service account, you will need to create a new service account and download a JSON key file for the account. You will also need to grant the service account the appropriate permissions to access BigQuery.
Then, you can use the spark-bigquery-connector library to read and write data from BigQuery. Here is an example of how to read data from a BigQuery table into a PySpark DataFrame:
Sample code
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("BigQuery") \
.config("google.cloud.auth.service.account.json.keyfile",
"<path-to-key-file>") \
.getOrCreate()
table_id = "<project-id>.<dataset-id>.<table-id>"
df = spark.read.format("bigquery").option("table", table_id).load()
You need to replace <path-to-key-file> with the path to the JSON key file for the service account, and <project-id>.<dataset-id>.<table-id> with the appropriate values for your BigQuery table.