One of its key features of DBT is the ability to capture historical changes in data over time using snapshots. This article will provide a detailed explanation of how to use DBT snapshots to track historical changes in data, along with a practical example.
What are DBT Snapshots?
DBT snapshots are a mechanism for creating a historical record of your data by capturing changes over time. They store a copy of your data at specific intervals, allowing you to analyze trends and detect patterns in your data that may not be evident in the most recent version.
How to Create a DBT Snapshot ?
To create a snapshot, you need to complete the following steps:
- Configure your dbt_project.yml file
- Create a snapshot configuration file
- Define the snapshot strategy
- Run the snapshot command
Step 1: Configure your dbt_project.yml file
Start by adding a ‘snapshots’ configuration to your dbt_project.yml file, specifying the location where your snapshot files will be stored. For example:
name: my_project
version: 1.0.0
config-version: 2
...
snapshots:
- path: snapshots
Step 2: Create a snapshot configuration file
Next, create a new file in the ‘snapshots’ directory with a .sql
extension. This file will define the SQL query for your snapshot and include the snapshot configuration.
For our example, let’s assume we want to track changes in a table called ‘orders’ that has the columns ‘id’, ‘status’, and ‘updated_at’. Create a file named ‘orders_snapshot.sql’ with the following content:
{% snapshot orders_snapshot %}
SELECT id, status, updated_at
FROM {{ source('my_project', 'orders') }}
{% endsnapshot %}
Step 3: Define the snapshot strategy
DBT supports two snapshot strategies: ‘timestamp’ and ‘check’. In our example, we’ll use the ‘timestamp’ strategy, which captures changes based on a timestamp column in the source data.
To define the snapshot strategy, add a snapshots.yml
file in the ‘snapshots’ directory with the following content:
version: 1
snapshots:
- name: orders_snapshot
strategy: timestamp
updated_at: updated_at
target_schema: orders_history
unique_key: id
This configuration tells DBT to use the ‘timestamp’ strategy based on the ‘updated_at’ column, store the historical data in a schema called ‘orders_history’, and use the ‘id’ column as the unique key to identify records.
Step 4: Run the snapshot command
Finally, run the snapshot command to capture the current state of your data:
dbt snapshot
This command will create a snapshot of your ‘orders’ table in the ‘orders_history’ schema, capturing the current state of your data. You can then run the command periodically to update the snapshot with new changes.
Example: Analyzing Historical Changes
With the snapshot in place, you can now analyze historical changes in your data. For instance, to find the count of orders with a specific status over time, you can write a query like this:
SELECT
status,
DATE_TRUNC('month', updated_at) AS month,
COUNT(*) AS order_count
FROM orders_history
GROUP BY status, month
ORDER BY month, status;
This query will provide a monthly breakdown of order counts by status, allowing you to analyze trends and changes over time.
DBT snapshots provide a powerful way to track historical changes in your data, enabling you to detect trends and patterns that may not be visible in the most recent version of the data. By configuring your dbt_project.yml file, creating a snapshot configuration file, defining a snapshot strategy, and running the snapshot command, you can capture and analyze changes in your data over time.
In our example, we demonstrated how to create a DBT snapshot for an ‘orders’ table and analyze the count of orders by status on a monthly basis. You can apply the same principles to your own data sets and use cases, enabling you to gain valuable insights into your data and make more informed decisions.
Get more useful articles on dbt