dbt (data build tool) has emerged as the industry standard for data transformations in recent years. It combines SQL accessibility with software engineering best practices, allowing data teams to design dependable data pipelines and document, test, and version-control them. The dbt ETL tool alleviates these frustrations by taking over the transformation step in your ETL pipelines.
While dbt is ideal for batch data transformations, it can only approximate the transformation of streaming data. Dealing with real-time data can be pretty challenging, especially when handling vast amounts of it.
But streaming dbt will be possible, as Materialize, a state-of-the-art SQL platform for processing streaming data, has announced a new dbt adapter. Keep reading to find why this could be a game-changer in the streaming analytics world.
What Is Materialize's dbt Adapter?
Materialize dbt adapter is an integration that allows you to transform real-time streaming data using Materialize as your data warehouse.
Together, these tools can allow data analysts to be the creators and users of streaming pipelines instead of relying on data engineers. The result could be a more streamlined analytics workflow as streaming capabilities become accessible across several data ecosystems.
The Problem With Today's Real-Time Analytics
To begin, what exactly do we mean by batch and streaming data? As the name implies, batch data is any data that arrives in discrete batches, which can be once a minute, once an hour, or once a day. On the other hand, streaming data comes continuously and on no particular schedule.
Let's see what challenges this can cause.
Batch-based tooling for real-time data is complicated for higher data volumes.
Batch-based methods are adequate for the majority of production use cases. Nonetheless, while actual real-time needs are rare, the stakes are generally more significant. Unfortunately, we do not have many options for meeting these needs right now.
Particularly at bigger data volumes, there is a limit to how much we can optimize SQL before performance suffers. So, as data quantities increase, we require streaming-specific gear.
Current dbt methods do not truly transform streaming data in real-time.
Let's look at how dbt transforms data beneath the hood to see why it doesn't currently transform streaming data in real-time.
Using dbt "models," dbt users define the data transformations they want. dbt models provide the following information: A SELECT statement carrying out the desired transformation and a materialization parameter.
dbt supports four different materializations: table, view, incremental, and ephemeral. The results of these materializations are either the creation of a table, a view, or the results directly using a common table expression (CTE) without persisting anything.
These database objects are sufficient for batch data transformations. But they are not efficient at transforming streaming data.
Real-Time Streaming Models Using dbt + Materialize
dbt would need to persist a database object that updates as new data gets upstream to perform reliable, real-time manipulations of streaming data. Fortunately, we have a database object that can do this: materialized views.
Materialize's materialized views, unlike typical materialized views, are constantly updated as new data arrives–no refreshes are required. Even better, they deliver real-time results with millisecond latency.
So, what does this have to do with dbt and streaming data? It means that when you execute a dbt model on top of Materialize for the first time, dbt persists in a materialized view. You'll never have to restart your model again. Your model will remain up to date regardless of how much or how frequently your data arrives.
Use dbt to Create Materialized Views for Streaming
Unlike querying tables or logical views, querying materialized views can minimize query costs by storing results in memory and only updating them when necessary.
Creating and maintaining a materialized view might help you save money on expensive or frequently run queries. The potential cost decrease of a materialized view, on the other hand, is highly dependent on its underlying refresh mechanism.
Only incremental refreshes can lower the per-refresh cost of keeping a materialized view while simultaneously ensuring that views are up to date as needed.
A key difference of Materialize compared to traditional data warehouse materialized views is that they work as constantly updated queries instead of being cached.
So, if you're a dbt user who is familiar with batch procedures, you would be delighted to know that Materialize's dbt adapter needs you to execute "dbt run" once, and your data will be up to date. Later in this article, we shall look at a use case to explain how that works.
Optimize Your Workflow With Materialize + dbt
Materialize + dbt integration allows data engineers and analysts to cooperate across numerous data warehouses for a far more simplified data transformations approach.
Connecting your GitHub account to dbt Cloud, for example, offers innovative and appealing capabilities in dbt Cloud. Once your GitHub account is linked, you can do the following:
- When Pull Requests are opened in GitHub, CI builds are triggered.
- Log in to dbt Cloud using GitHub OAuth.
- With a simple click, you may add additional repositories to the Cloud.
Use Macros to Parameterize and Deploy Views
A fantastic technique to scale Materialize pipelines is parameterizing and deploying views through macros. SQL may be combined with Jinja, a templating language, in dbt.
Using Jinja transforms your dbt project into a SQL programming environment, allowing you to perform things that aren't ordinarily feasible with SQL. For instance, with Jinja, you can:
- Use control structures (such as if statements and for loops).
- For production deployments, use environment variables.
- Modify the way your project is built depending on the current target.
- Use the output of one query to generate another.
- Use macros.
Macros are chunks of code in Jinja that may be reused numerous times — like "functions" in other programming languages. They are convenient if you find yourself repeating code across multiple models.
Demo: Using dbt + Materialize to Stream Wikipedia Data
This demo project illustrates how to turn streaming Wikipedia data into materialized views in Materialize using dbt. Refer to this guide to get everything you need to run dbt with Materialize.
To start, let's set up a stream of Wikipedia's recent changes and write all the data we see to a file.
_Note: If you are running Materialize via a Docker container, run docker exec -it [materialized container id] /bin/sh
before curl-ing to create this file directly within the Materialize container.
_
From a new shell, run:
while true; do
curl --max-time 9999999 -N https://stream.wikimedia.org/v2/stream/recentchange >> wikirecent
done
Note the absolute path of the location of wikirecent, which we'll need in the next step.
Connect to your Materialize instance from your shell:
psql -U materialize -h localhost -p 6875 materialize
Then, create a source using your wikirecent file:
CREATE SCHEMA wikimedia;
CREATE SOURCE wikimedia.wikirecent
FROM FILE '[path to wikirecent]' WITH (tail = true)
FORMAT REGEX '^data: (?P.*)';
This source takes the lines from the stream, finds those that begin with data:And captures the rest of the line in a column called data.
Now we can use dbt to create materialized views on top of wikirecent. In your shell, navigate to play/wikirecent-dbt within the clone of the repo on your local machine. Once there, run the following dbt command:
dbt run
Note: If the profiles.yml you're using for this project is not located at ~/.dbt/, you will have to provide additional information to dbt run.
This command generates executable SQL from our model files (found in the models directory of this project) and executes that SQL against the target database, creating our materialized views.
Note: If you installed dbt-materialize in a virtual environment, ensure it's activated.
That's it! You just used dbt to create materialized views in Materialize. You can verify the views were created from your psql shell connected to Materialize.
Top comments (0)