What is Seeds in dbt? How to load CSV files into your data warehouse using the dbt ?

getDbt

Seeds are the CSV files in your dbt project. There will be a seed director , that dbt can load into your data warehouse using the dbt seed command. Seeds can be referenced in the downstream models the same way as referencing models, using the ref function. Seeds are best suited to static data which changes infrequently as they are version controlled and code reviewable.

  1. Seeds: In dbt, a “seed” is a specific type of data source or table that contains static data. This data doesn’t change frequently and is typically used as a starting point for your data transformation process. Seeds are often used for reference data, historical data snapshots, or other datasets that remain relatively stable over time.
  2. Version Control: The statement mentions that seeds are “version controlled.” This means that like other code and configuration files in your dbt project, seeds can be tracked and managed using version control systems like Git. This is important for tracking changes, collaborating with team members, and ensuring that historical versions of the seed data are accessible.
  3. Code Reviewable: Seeds are considered “code reviewable,” which means that changes to seed data can go through a code review process, just like changes to your SQL code or other project assets. Code reviews help maintain data quality, consistency, and best practices by allowing team members to review and provide feedback on changes to the seed data.

Now, let’s understand why seeds are best suited for static data that changes infrequently:

  • Stability: Since seeds are typically used for static or reference data, they are designed to represent a stable and consistent source of information. Changes to this data should be infrequent to maintain its reliability.
  • Version Control and Code Review: Seeds are valuable for data that is important to the integrity of your analytics or reporting. By version controlling and code reviewing changes to seed data, you can ensure that modifications are well-documented, reviewed for accuracy, and don’t inadvertently introduce errors into your analytics.
  • Efficiency: Frequent changes to seed data can introduce complexity and potentially slow down your data transformation processes. Using seeds for static data helps maintain efficiency in your dbt workflow.

Example

Add the file to your seeds directory, with a .csv file extension, e.g. seeds/city_zip_code.csv
seeds/city_zip_code.csv
city,zip_Code
Ellicott City,21043
Upper Marlboro,20774
Rockville,20852

Run the dbt seed command command — a new table will be created in your warehouse in your target schema, named city_zip_codes

$ dbt seed
Running with dbt=1.00.1

Refer to seeds in downstream models using the ref function.

models/myzip.sql
— This refers to the table created from seeds/city_zip_codes.csv

select * from {{ ref('city_zip_codes') }}

Seeds are configured in your dbt_project.yml

Points to note

Seeds should not be used to load raw data. As seeds are version controlled, they are best suited to files that contain business-specific logic, for example a list of country codes or user IDs of employees. Seeds are not intended for the importation of raw data. Given that seeds are subject to version control, they are ideally applied to files imbued with business-centric logic, such as those listing country codes or employee user IDs.

By default, dbt expects your seed files to be located in the seeds subdirectory of your project. To change this, update the seed-paths configuration in your dbt_project.yml

seed-paths: ["custom_seeds"]

If you want to rerun the command with a –full-refresh . The below command dbt truncates the existing table and reinserts the data

dbt seed --full-refresh

dbt will infer the datatype for each column based on the data in your CSV. If you need to explicitly set a datatype, then that can be achieved by using the column_types configuration in dbt_project.yml

seeds:
  freshers_in: # you must include the project name
    warehouse_locations:
      +column_types:
        article_title: varchar(500)

To build one seed at a time you can use 

$ dbt seed --select country_codes

If you want to exclude , you can use –exclude function as well 

Author: user

Leave a Reply