Google BigQuery’s prowess as a data warehouse is well-established. But its ability to seamlessly integrate with external data sources makes it a flexible and powerful tool for data professionals. Whether it’s a spreadsheet on Google Sheets or files in Cloud Storage, BigQuery can pull, process, and provide insights with ease. Integration capabilities set BigQuery apart as more than just a data warehouse. By connecting with external sources, analysts can unify, query, and generate insights from data spread across various Google platforms. With federated querying at your fingertips, the boundaries for data analytics are truly expansive.
This article will delve into BigQuery’s capabilities to connect with external data sources, guide you on setting up these connections, and show you how to write queries that span across your internal and external datasets.
Overview of BigQuery’s External Data Capabilities
BigQuery offers a feature called “federated querying” which allows users to analyze data across different Google services without needing to load them into BigQuery first. Supported external sources include:
1. Google Cloud Storage
2. Google Drive (including Google Sheets)
3. Google Cloud Bigtable
4. Google Cloud SQL
Setting Up Connections to External Data Sources
Share the Sheet with the service account tied to your BigQuery project.
Get the Google Sheets URL and extract the ID part from it.
Create an external table in BigQuery using this ID.
Google Cloud Storage:
Ensure that the BigQuery service account has the necessary permissions (like storage.objects.get) for the Cloud Storage bucket.
Create an external table, specifying the Cloud Storage URI as the data location.
Other sources like Cloud SQL and Bigtable:
These involve more specific configurations based on the nature of the source, but the general principle is to establish authenticated access and then define external tables in BigQuery pointing to these sources.
Let’s demonstrate with a hypothetical Google Sheets connection:
Assuming you have a Google Sheet with an ID YOUR_GOOGLE_SHEET_ID and the Sheet has data in the format:
+------+-------+ | Name | Score | +------+-------+ | John | 90 | | Jane | 85 | +------+-------+
You can create an external table in BigQuery:
CREATE EXTERNAL TABLE my_dataset.my_external_table OPTIONS ( format="GOOGLE_SHEETS", uri="https://drive.google.com/open?id=YOUR_GOOGLE_SHEET_ID" ) AS SELECT * FROM my_dataset.my_internal_table;
Then, if you have an internal table with more student scores:
CREATE TABLE my_dataset.my_internal_table AS SELECT "Alex" AS Name, 88 AS Score UNION ALL SELECT "Alice", 92;
You can combine data from both:
SELECT Name, AVG(Score) as Average_Score FROM (SELECT * FROM my_dataset.my_external_table UNION ALL SELECT * FROM my_dataset.my_internal_table) GROUP BY Name;
+-------+--------------+ | Name | Average_Score| +-------+--------------+ | John | 90 | | Jane | 85 | | Alex | 88 | | Alice | 92 | +-------+--------------+