In DBT’s world, the terminology can seem a bit mystifying to newcomers, especially when we talk about “sources” and “seeds”. This article aims to break down these concepts, allowing you to better understand their roles within your DBT project.
What are Sources?
Sources in DBT represent raw data from your original data sources, usually databases, APIs, or files. They are the initial data inputs to your DBT project and are described in DBT’s YAML syntax. Sources provide DBT with metadata about these data sources, helping you maintain control over and awareness of the starting point of your transformations. Sources and seeds are two foundational concepts in DBT that serve as the starting point of your transformations. Understanding
To define a source, you need to create a source file under the /sources directory of your DBT project. The following is a simple example of a source file:
version: 2
sources:
- name: freshers_source_database
database: freshers_database_name
schema: freshers_schema_name
tables:
- name: customers
- name: orders
In this example, freshers_source_database is a logical grouping for your raw tables, customers and orders, which are stored in freshers_database_name and freshers_schema_name.
Defining sources in DBT comes with several benefits:
Documentation: Sources document the raw data in your project and the initial state of your transformations.
Testing: DBT allows you to write tests against your sources, helping ensure the quality of your raw data.
Lineage: DBT uses sources to create a lineage graph that visually shows the relationship between your models and the raw data they use.
What are Seeds?
Seeds in DBT are files containing data that you want to load into your data warehouse. Usually, seeds are small tables in CSV format stored in your DBT project repository. They can contain information like category mappings, lists of regions, or other static data that your transformations require.
To use seeds, you place your CSV file in the /seeds directory of your DBT project. Then, you can use the dbt seed command to load this data into your data warehouse.
Here’s an example of how a seed file might look:
region_id,region_name
1,KERALA
2,GUJARAT
3,KARNATAKA
You can use this in your DBT models like so:
SELECT
customers.*,
region_mapping.region_name
FROM {{ ref('customers') }} AS customers
LEFT JOIN {{ ref('region_mapping') }} AS region_mapping
ON customers.region_id = region_mapping.region_id
In this SQL code, region_mapping is the table created from your seed file and is used to add region names to your customers table.
Seeds in DBT offer several advantages:
Simple data loading: With the dbt seed command, you can easily load your CSV data into your data warehouse.
Version control: Since seed files are stored in your DBT project repository, they are version-controlled.
Transformation use: Seeds can be used in your transformations just like any other table in your data warehouse.
Get more useful articles on dbt