Often in data analytics, there’s a need to extract specific parts of a date or timestamp for more granular analysis. One such common operation is extracting minutes from a timestamp. This tutorial focuses on achieving this in Google BigQuery and processing the results with PySpark.
Step 1: Setting up the Table in BigQuery
We’ll first create a table named freshers_in_page_views in BigQuery with sample data.
CREATE TABLE freshers_in_page_views (
view_id INT64,
view_timestamp TIMESTAMP,
view_count INT64
);
Step 2: Inserting Sample Data
Let’s insert some rows of sample data for demonstration.
Insert Statement:
INSERT INTO freshers_in_page_views (view_id, view_timestamp, view_count)
VALUES
(1, TIMESTAMP "2023-09-14 15:43:22 UTC", 120),
(2, TIMESTAMP "2023-09-14 12:05:45 UTC", 89),
(3, TIMESTAMP "2023-09-14 09:15:00 UTC", 155);
INSERT INTO freshers_in_page_views (view_id, view_timestamp, view_count)
VALUES
(1, TIMESTAMP "2023-09-14 15:43:22 UTC", 120),
(2, TIMESTAMP "2023-09-14 12:05:45 UTC", 89),
(3, TIMESTAMP "2023-09-14 09:15:00 UTC", 155);
Step 3: Extracting Minutes from Timestamp in BigQuery
Using the EXTRACT function in BigQuery, we can retrieve the minutes of our timestamps.
Select Statement:
SELECT
view_id,
EXTRACT(MINUTE FROM view_timestamp) AS minute_of_hour,
view_count
FROM freshers_in_page_views;
Output
view_id | minute_of_hour | view_count
------- | -------------- | ----------
1 | 43 | 120
2 | 5 | 89
3 | 15 | 155
Step 4: Processing the Result in PySpark
Once you’ve exported the results from BigQuery, you can utilize PySpark for further processing.
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("bigquery-pyspark @ Learning at Freshers.in ").getOrCreate()
# Load the data (assuming a CSV export from BigQuery)
df = spark.read.csv("path_to_your_exported_file.csv", header=True, inferSchema=True)
# Displaying the dataframe with minutes extracted
df.show()
BigQuery import urls to refer