In this video we learned how to test and document dbt models. We also learned about the codegen library. This is part of Week 4 of the data engineering zoomcamp by DataTalksClub.
We want to identify errors quickly, and we don't want to build on top of data that is full of errors. Tests are assumptions we make about the data. These assumptions are compiled to SQL, which returns the number and type of failing records. They are defined on table columns in the .yml file.
DBT provides tools to check if column values are unique, not null, accepted values, and existing foreign keys in another table. We can also create custom tests as queries.
In order to do this quickly, I installed a library called "codegen". This library generates code that I could copy-paste into files. I used the code for creating columns in a yaml file. First I installed codegen by typing into packages.yml:
- package: dbt-labs/codegen
version: 0.12.1
Then I ran dbt deps
in the command line area. Next, I opened a scratch file (+ on the upper right of the tabs), and pasted the following:
{% set models_to_generate = codegen.get_models(directory='staging', prefix='stg') %}
{{ codegen.generate_model_yaml(
model_names = models_to_generate
) }}
Then I hit "compile selection". This generated the column names for all the models that started with stg, along with their types and a blank "description" field. I copied all of this into schema.yml under the name "models". Here's a partial view:
version: 2
sources:
- name: staging
database:
data-engineering-2024-411821
# For postgres:
#database: production
schema:
trips_data_all
# loaded_at_field: record_loaded_at
tables:
- name: green_tripdata
- name: yellow_tripdata
- name: fhv_tripdata
models:
- name: stg_green_tripdata
description: ""
columns:
- name: tripid
data_type: string
description: ""
tests:
- unique:
severity: warn
- not_null:
severity: warn
- name: vendorid
data_type: int64
description: ""
- name: ratecodeid
data_type: int64
description: ""
You can see where I added a test for the tripid. I wanted tripid to be unique and not null. These are two of the tests provided by dbt. The severity is warn because for now, I don't actually want to stop the build.
Another test is relationships. For this, I needed to identify the field and table that's expected to match the values in the column.
- name: pickup_locationid
data_type: int64
description: ""
tests:
- relationships:
field: locationid
to: ref('taxi_zone_lookup')
severity: warn
- name: dropoff_locationid
data_type: int64
description: ""
tests:
- relationships:
field: locationid
to: ref('taxi_zone_lookup')
severity: warn
finally, I added a test for accepted values to the payment type. I know there are six payment types, so I defined the variable "payment_type_values" from before (in dbt_project.yml):
vars:
payment_type_values: [1, 2, 3, 4, 5]
I know that there are six values, and I wanted to test my tests, so I left out "6". Back in schema.yml, I added the test accepted_values to payment_type using the "var" macro to insert it. quote:false
is apparently something we need for BigQuery, but that wasn't explained.
- name: payment_type
data_type: int64
description: ""
tests:
- accepted_values:
values: "{{ var('payment_type_values') }}"
severity: warn
quote: false
I ran the tests, and I saw that I got a warning that there are some payment_type values that aren't expected. This is actually very good. We can test our data, which consists of millions of rows, to make sure there's nothing wrong with it.
Top comments (0)