For Week 4 of DataTalksClub's data engineering zoomcamp, we had to install dbt and create a project. This was a formidable task. dbt is a data transformation tool that enables data analysts and engineers to transform data in a cloud analytics warehouse, BigQuery in our case. It took me a very long time to do this, and in this case I needed the homework extension.
Installing dbt
I went to install dbt and found I had an account from two years ago, that was still trying to run something every day. It was still connected to a GitHub repository on my account. Apparently, every day, it tried to run something, and then failed. I had a string of failures, each of which took just a few seconds to fail. I tried to install a new version of dbt. I used another email to do it. That one gave me a hard time too, because of a mix of deleting github access to dbt and not following the instructions. I think I've got it now. I'm not going to go into the details of setting up a dbt environment and linking to github, because I didn't follow the straight and narrow path to do it. I still have to stop the process running for two years on my other account.
Preparing the data
So now let's get on with starting a project. I needed yellow and green taxi data tables in BigQuery for the years 2019 and 2020 to follow along with the video, and I needed a for-hire vehicle data table for 2019. I used a python file to start the process of loading the taxi data into a GCS bucket. This apparently went well for all the data. I had parquet files loaded into the GCS bucket. Then I made tables in BigQuery. That went well for green and yellow taxi data. I made an external table first and then a materialized table for both sets of data. Unfortunately, for the FHV data, I had data type errors that I couldn't resolve. I tried loading the parquet files directly from the taxi website (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), but I still had data type errors. Eventually I gave up and used the .csv files from the data-engineering-zoomcamp repository for the FHV files. I'm sorry to say at this point I just downloaded them manually and didn't use a python notebook or Mage or dlt as we have been learning throughout the course. I downloaded them to my computer, unzipped them with gunzip, and then uploaded them directly into the bucket from GCP.
Connecting to github
I had connected to my github repo that had all the files from the data engineering zoomcamp in it already. So in the dbt console (the IDE, or integrated development environment), I already had the directory structure available to me. I had some problems following the video. I had errors because I had copied the resulting file before I had any macros or dbt_utils imported. But in the end, I was able to build a project and run it in the development mode.
A model is a .sql file that's compiled into a query. It starts out with a line in Jinja, a templating engine for python. Jinja templates are enclosed in either two curly braces or curly braces with percent symbols. Here, we're telling dbt how to set up the table it's going to create for the model. {{ config(materialized="view") }}
This means for this particular model, we'll create a view in BigQuery in our database, in the schema that we defined in the dbt_project.yml file.
For the rest of the model, we tell it to use the particular source from the staging directory in dbt, in this case "yellow_tripdata". We use various Jinja template functions to cast the variables, change the names of the variables, and add some other macros. In this case, we're changing the name of tpep_pickup_datetime to pickup_datetime, because the green taxi data has the name lpep_pickup_datetime, and we're going to combine them into one database. Also we are going to define a macro "get_payment_type_description" to describe the payment type, and a variable to limit the size of the model for development iterations.
with
tripdata as (
select *, row_number() over (partition by vendorid, tpep_pickup_datetime) as rn
from {{ source("staging", "yellow_tripdata") }}
where vendorid is not null
)
select
-- identifiers
{{ dbt_utils.generate_surrogate_key(["vendorid", "tpep_pickup_datetime"]) }} as tripid,
{{ dbt.safe_cast("vendorid", api.Column.translate_type("integer")) }} as vendorid,
{{ dbt.safe_cast("ratecodeid", api.Column.translate_type("integer")) }}
as ratecodeid,
{{ dbt.safe_cast("pulocationid", api.Column.translate_type("integer")) }}
as pickup_locationid,
{{ dbt.safe_cast("dolocationid", api.Column.translate_type("integer")) }}
as dropoff_locationid,
-- timestamps
cast(tpep_pickup_datetime as timestamp) as pickup_datetime,
cast(tpep_dropoff_datetime as timestamp) as dropoff_datetime,
-- trip info
store_and_fwd_flag,
{{ dbt.safe_cast("passenger_count", api.Column.translate_type("integer")) }}
as passenger_count,
cast(trip_distance as numeric) as trip_distance,
-- yellow cabs are always street-hail
1 as trip_type,
-- payment info
cast(fare_amount as numeric) as fare_amount,
cast(extra as numeric) as extra,
cast(mta_tax as numeric) as mta_tax,
cast(tip_amount as numeric) as tip_amount,
cast(tolls_amount as numeric) as tolls_amount,
cast(improvement_surcharge as numeric) as improvement_surcharge,
cast(total_amount as numeric) as total_amount,
coalesce(
{{ dbt.safe_cast("payment_type", api.Column.translate_type("integer")) }}, 0
) as payment_type,
{{ get_payment_type_description("payment_type") }} as payment_type_description
from tripdata
where rn = 1
-- dbt build --select <model.sql> --vars '{'is_test_run: false}'
{% if var("is_test_run", default=true) %} limit 100 {% endif %}
With this file, which we build with the dbt build command, we transformed the raw data into something we can work with. The next step is to create the fact model that will be the model that we present to the business intelligence and stakeholders. For this we create another .sql file named fact_trips.sql.
{{ config(materialized="table") }}
with
green_tripdata as (
select *, 'Green' as service_type from {{ ref("stg_green_tripdata") }}
),
yellow_tripdata as (
select *, 'Yellow' as service_type from {{ ref("stg_yellow_tripdata") }}
),
trips_unioned as (
select *
from green_tripdata
union all
select *
from yellow_tripdata
),
dim_zones as (select * from {{ ref("dim_zones") }} where borough != 'Unknown')
select
trips_unioned.tripid,
trips_unioned.vendorid,
trips_unioned.service_type,
trips_unioned.ratecodeid,
trips_unioned.pickup_locationid,
pickup_zone.borough as pickup_borough,
pickup_zone.zone as pickup_zone,
trips_unioned.dropoff_locationid,
dropoff_zone.borough as dropoff_borough,
dropoff_zone.zone as dropoff_zone,
trips_unioned.pickup_datetime,
trips_unioned.dropoff_datetime,
trips_unioned.store_and_fwd_flag,
trips_unioned.passenger_count,
trips_unioned.trip_distance,
trips_unioned.trip_type,
trips_unioned.fare_amount,
trips_unioned.extra,
trips_unioned.mta_tax,
trips_unioned.tip_amount,
trips_unioned.tolls_amount,
trips_unioned.improvement_surcharge,
trips_unioned.total_amount,
trips_unioned.payment_type,
trips_unioned.payment_type_description
from trips_unioned
inner join
dim_zones as pickup_zone on trips_unioned.pickup_locationid = pickup_zone.locationid
inner join
dim_zones as dropoff_zone
on trips_unioned.dropoff_locationid = dropoff_zone.locationid
Here, we've unioned the two tables together - yellow and green - and also joined the seed file dim_zones using the Jinja "ref" template. This picture from dbt will help us to understand what's going on:
So now we have the database, "fact_trips", that we'll be using to create visualizations for the business stakeholders. Additionally, we can create data "marts" to reveal aspects of the data of interest to particular stakeholders. This is the next step of the processβto create the dimensional models that are used to present the data to the stakeholders.
Top comments (0)