Transforming data in a way that is both accurate and meaningful can sometimes involve complex logic, particularly when it comes to joins and aggregations. DBT (Data Build Tool) is an excellent resource that allows data analysts and engineers to perform these transformations using familiar SQL syntax while maintaining the integrity and consistency of the data. This article aims to shed light on how you can leverage DBT to handle complex transformation logic involving joins and aggregations.
Understanding complex transformations
Complex transformations typically involve operations like joining multiple tables, filtering and aggregating data, and sometimes nested queries. These can quickly become hard to manage if not handled correctly. With DBT, you can break down these complex transformations into smaller, more manageable steps through the use of intermediate models.
Creating base and Intermediate models
Before diving into the complexities of joins and aggregations, start by creating base models. Base models represent the raw data in your source tables, with some initial cleansing, if necessary. Let’s consider we have an ‘orders’ and ‘users’ table:
-- freshers_in_base_orders.sql
{{
config(
materialized='view'
)
}}
SELECT *
FROM {{ source('raw_data', 'freshers_in_orders') }}
-- freshers_in_base_users.sql
{{
config(
materialized='view'
)
}}
SELECT *
FROM {{ source('raw_data', 'freshers_in_users') }}
After setting up base models, you can create staging or intermediate models. These models perform additional transformations, such as filtering and deriving new columns.
Handling joins
To join multiple tables, you can create an intermediate model that references your base models:
-- freshers_in_stg_orders.sql
{{
config(
materialized='table'
)
}}
SELECT
o.order_id,
o.order_date,
o.amount,
u.user_id,
u.username
FROM {{ ref('freshers_in_base_orders') }} o
LEFT JOIN {{ ref('freshers_in_base_users') }} u
ON o.user_id = u.user_id
This example uses a LEFT JOIN to combine the orders and users tables. DBT allows you to perform INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN, and CROSS JOIN operations as per your business requirements.
Handling aggregations
When performing aggregations, it’s often beneficial to create another intermediate model that performs the aggregation logic. This decoupling allows you to reuse the join logic for other transformations and tests. For example, you can create an intermediate model that calculates the total sales per user:
-- freshers_in_agg_sales.sql
{{
config(
materialized='table'
)
}}
SELECT
user_id,
SUM(amount) as total_sales
FROM {{ ref('freshers_in_stg_orders') }}
GROUP BY user_id
Managing complex aggregations with window functions
For more complex aggregations, such as calculating running totals or averages, you can leverage window functions. Window functions perform a calculation across a set of rows related to the current row, making them ideal for these types of tasks.
-- freshers_in_agg_running_total.sql
{{
config(
materialized='table'
)
}}
SELECT
user_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total
FROM {{ ref('freshers_in_stg_orders') }}
The window function calculates a running total of sales for each user, ordered by the date of the order.
Testing transformations
As you deal with complex transformations, it’s critical to continually test your data to ensure its accuracy and consistency. DBT offers several in-built tests like uniqueness, referential integrity, and not-null checks.
models:
my_project:
freshers_in_agg_sales:
constraints:
unique:
- user_id
not_null:
- user_id
- total_sales
Get more useful articles on dbt