Dbt is a tool for transforming data in data warehouses using Python and SQL. The idea is that Data Engineers make source data available to an environment where dbt projects run (for instance with Debezium or with Airflow), and Data Analysts can then run their dbt projects against this data to produce models (tables and views) that can be used with BI tools.
This layer allows the decoupling of the models on which reports and dashboards rely from the source data, and if our business rules or our source systems change we can still maintain the same models as a stable interface.
Some of the things that dbt can do include:
- import reference data from csv files
- track changes in source data with different strategies so that downstream models do not need to be built every time from scratch
- run tests on data, to confirm assumptions remain valid, and to validate any changes made to the models' logic
Due to its unique capabilities, CrateDB is an excellent warehouse choice for data transformation projects. It offers automatic indexing, fast aggregations, easy partitioning, and the ability to scale horizontally. In this article, I will illustrate how to get the most important functionalities of dbt working by doing the necessary changes in the configuration.
Our starting point will be a fresh install of dbt-postgres
:
pip install dbt-postgres==1.6.0
We can then create a profiles file with our connection details:
cd ~
mkdir .dbt
cat << EOF > .dbt/profiles.yml
example_datawarehouse_profile:
target: dev
outputs:
dev:
type: postgres
host: localhost
port: 5432
database: crate
schema: doc
search_path: doc
user: dbt
password: pwd1234567A
EOF
(please note the values for database
, schema
, and search_path
in this example)
We will not go into the details of how the project files are structured (for more information check out dbt’s documentation), but in general, a dbt project consists of a combination of SQL, Jinja, YAML, and markdown files. In our project folder, alongside the models
folder that most projects have, we can also create a folder called macros
where we can place macro overrides.
Let's then create a macros
folder and place some files with overrides on it:
mkdir macros
cd macros
wget https://community.crate.io/uploads/short-url/fKupQCFUHtuoKom3jAfKrldUXkt.sql
wget https://community.crate.io/uploads/short-url/qvQExEq1OopiVUcXACLGfpdGHYF.sql
wget https://community.crate.io/uploads/short-url/3jcFxL1EExLrERJSTc6ScnzTS9f.sql
cd ..
A few things I have tested with these overrides:
- models with view, table, and ephemeral materializations
- dbt source freshness
- dbt test
- dbt seed
-
Incremental materializations (with
incremental_strategy='delete+insert'
and without involving OBJECT columns)
I hope you find this useful. CrateDB is continuously adding new features and I will endeavor to come back and update this article if there are any developments and some of these overrides require changes or become obsolete.
Top comments (0)