Aram Panasenco

Posted on Oct 18 • Edited on Oct 21

Test-Driven Development For Analytics Engineering

#data #analytics #tdd #datascience

As long as end users trust their queries against raw data more than they trust the analytics engineering team and their data models, nothing the analytics engineering team does matters. While so-called 'best practices' are almost never applicable for every kind of organization and every situation, I do believe that every analytics engineering team can benefit from adopting test-driven development. The most important thing about test-driven development is not just that it enhances data quality and the perception of data quality (though it does do those things), but that it enables analytics engineers to have trust and confidence in themselves.

What is test-driven development?

Test-driven development (TDD), as the name implies, is about making tests drive the development process. The steps of test-driven development are:

Gather concrete requirements.
Turn a requirement into a test.
Run the test - it should fail.
Implement the change.
Run all tests - they should all pass now.
Repeat from step 1.

Following this simple process will have huge effects on the quality of your data models and your relationships with data customers.

The meaning of trust

In his 2016 book The Speed of Trust, Stephen M. R. Covey defines trust as consisting of four components:

Integrity
Intent
Capabilities
Results

Covey also writes that being trusted by others has to start by trusting yourself. Do you as an analytics engineer have confidence in your own integrity, intent, capabilities, and results? That confidence is the prerequisite to being trusted by your data customers.

How TDD enhances your confidence in your own integrity

In my experience, analytics engineers are quick to promise "I'll fix it so it doesn't happen again," but are hesitant to promise "I'll catch it first if it happens again." Subconsciously they betray their own confidence in their own integrity. After all, how can you be sure you've fixed an issue if you don't know whether it's happening?

TDD allows you to give a factual statement like "I've written a test that reproduces the issue you're experiencing" instead of giving promises about things potentially outside of your control. Depending on the maturity of your automated testing and alerting framework, you may be able to say even more. For example: "Once deployed, this test will run daily and will alert us if this issue reoccurs."

Data issues tend to spontaneously "un-fix" themselves all the time, and you don't necessarily have control over that. But you do have control over your development process. Writing tests first will enable you to communicate what you're doing instead of burying yourself deeper and deeper in promises. This will grow confidence in your integrity, from yourself as well as from others.

How TDD enhances your confidence in your own intent, capability, and results

Put yourself in the shoes of a data customer. You've carefully prepared examples of the kind of output you need and sent them to the analytics engineer. The engineer comes back with a finished data model. While validating the results, you find that one of the very examples you've given them isn't even correct in the output! Would you believe that analytics engineer really cares about helping you? That they have solid abilities? That they can drive results? And what effect would all this have on that engineer's opinion of themselves?

Most of the time, this kind of experience is caused not by a lack of care or ability, but by a regression. Regressions happen even to the most caring and capable engineers. Here's how: The analytics engineer works on the examples one at a time. However, the logic changes they make to satisfy the second example can inadvertently break the first example. The problem compounds as new edge cases are introduced. Working on the tenth example can break any one of the previous nine. Without automated testing, these regressions can be almost impossible to catch.

Over the course of the last major analytics engineering project I've worked on, the tests I wrote caught three regressions I'd accidentally introduced. If I hadn't had the tests, there's a chance that I could be thought of (and think of myself) as a sub-par engineer who doesn't even get the example rows right three times in a single project. Instead, I have complete confidence that all the examples are satisfied, and that I can take on any additional complexity without introducing regressions. This is a matter of discipline, not of intelligence or ability.

How to do test-driven development

Start with the data the customer actually needs

It all starts by talking to the data consumer and understanding their needs. If something broke, what's an example? Turn that example into your test. If something new is needed, what does it look like? Ask them to mock up a few examples cases in a spreadsheet-like format. The rows of that spreadsheet become your tests.

Don't worry about source data, facts, or dimensions in your tests. Focus on the highest level of what the customer needs. Ask them and they'll be able to represent their needs in a spreadsheet-like format every time.

Think about the columns of that spreadsheet. One or more of those columns will be able to be used as an identifier of that particular example row. There should only be one row with that identifier. Values in some of the rest of the columns will represent the business logic of the example. Therefore, we need to test two things: Does that example row exist, and do the business logic values match?

A dbt data test returns failing records. In other words, the test has succeeded when no rows are returned. Here's an example implementation:

with row_count as (
    select
        count(*) as n
    from {{ ref("denormalized_view") }}
    where id1 = 'some_id'
        and id2 = 'other_id'
)
select 'Not exactly 1 row' as error_msg
from row_count
where n <> 1
union all
select 'Row test failed' as error_msg
from {{ ref("denormalized_view") }}
where id1 = 'some_id'
    and id2 = 'other_id'
    and not (
        value1 = 'some example value'
        and value2 is not null and value2 = 'some other value'
        and abs(value3) <= 0.01
    )

Let's dissect what's happening in this query. There are two select statements joined together with a union all. The first will return a failing record if the row identified by the identifier(s) doesn't exist in the data. This is important so we don't inadvertently pass a test when the data is not there in the first place. The second identifies that same row, and then looks for any discrepancies in the business logic values. That's easiest to achieve by wrapping the expected values in a not().

Do watch out for null values. Due to three-valued logic in SQL, the filter not(column = 'value') will not return rows where the column is null. I recommend testing for nulls separately using dbt's generic not_null test so that you don't have to remember each time.

This kind of test is very easy to copy and paste and adapt quickly. It's also easy to read and maintain. This will be all you need 90% of the time.

It's easy to accidentally write a SQL query that produces no rows. That's why it's also easy to write a dbt data test that accidentally passes. The test should be written and run first, before any development work is done. The test should fail. Then the change should be implemented, and the test should succeed.

Documentation-driven development

In addition to tests, I also encourage you to write the documentation as you're gathering requirements, not after the data model is written. dbt allows model properties (including documentation) to be defined before any SQL is written.

For example, if you're developing a transactions model with your accounting team, you can create the file models/denormalized/accounting/_accounting__models.yml:

models:
  - name: accounting_transactions
    description: Transactions table for accounting
    columns:
      - name: transaction_key
        description: Synthetic key for the transaction
        tests:
          - not_null
          - unique
      - name: action_date
        description: Date of the transaction
        tests:
          - not_null

You should be taking notes when gathering data customer requirements anyway. Instead of writing the notes down in something like a Google Doc or an email, take notes in this YAML format instead. That'll get you kick-started on your documentation and testing.

Edge cases that don't exist in production data

You will often have to write logic that encompasses things that don't exist in production data, but potentially could. On one hand, it's good to be defensive and handle potential mistakes before they happen. On the other hand, it's very hard to write tests for things that aren't there.

There are a couple of potential solutions here:

Just write the logic, don't write any additional tests.
If there's a non-production environment of the source system, the data model could be pointed to that non-production environment for development and pull requests. Then all kinds of edge cases could be created in the non-production system and tests written as normal.
If there is no non-production environment of the source system, dbt unit tests can be used. Using the unit test syntax, you can define your edge case inputs in your model's YAML.

The most realistic approach is to just write the logic without writing additional tests. Analytics engineers work on tight enough deadlines that writing tests for things that aren't there is just not worth it in most situations. In the minority of cases where the logic's correctness is critical enough to justify the additional time investment, approach 2 or 3 above can be used.

Enforcing test-driven development

It's a good idea to work towards enforcing test-driven development in your analytics engineering team. Rather than surprising folks with a new policy, I recommend setting a deadline by which test-driven development will be mandated, and ensuring the team gets familiar with it before the deadline.

Here's an example workflow that incorporates test-driven development:

All dbt models are stored in a Git repo with a write-protected production branch. All changes to production have to come through pull requests with at least one peer approval.
Analytics engineers create feature branches off the production branch and open pull requests when the features are ready.
Every peer reviewer is expected to only approve the pull request if they see tests corresponding to every feature. If they don't see corresponding tests, that means TDD wasn't followed and the pull request shouldn't be approved.

What about TDD for data engineering?

It's tempting and somewhat justified to make data engineers follow TDD as well. However, the value proposition of TDD for data engineering is not as clear as for analytics engineering.

Since the predominant data warehousing paradigm shifted from ETL (extract-transform-load) to ELT (extract-load-transform), the role of data engineers also changed. Data engineers are now focused on querying APIs and then loading the responses into the data warehouse in the rawest possible form.

Analytics engineers work inside the data warehouse, which is a deterministic environment. The same inputs always produce the same outputs, and the logic used to create the outputs is complex. That's a perfect environment for TDD to be impactful.

Data engineers work in an almost opposite environment. Since they just extract and load, there's no logic at all. At the same time, they have to interface with external systems, which can have a whole host of unpredictable issues.

It's definitely possible to do test-driven development as a data engineer, but it's difficult and produces questionable benefits. Suppose you're loading data from the Facebook Ads API. How do you do that in a test-driven way? You could use requests-mock to simulate possible inputs with corresponding outputs and errors. However, the only thing you do with the output is load it into the data warehouse as directly as possible, so there's not much to test there. Additionally, you may not know what the possible errors are, and even if you do, there's nothing you can do about them from your end except retry.

For these reasons, I don't attempt to follow test-driven development when writing extract-and-load processes, and instead focus on architectural simplicity and plenty of retry mechanisms written with backoff or tenacity.

Conclusion

If you're an analytics engineer, I hope this post has convinced you to give test-driven development a try. If you're an analytics engineering team leader, I hope you consider making test-driven development a requirement for your team.

Analytics engineering is uniquely well-suited to test-driven development. The cost of effort of creating tests from end user requirements is low, and the cost of regressions from complex and untested business logic in your data models is high. In addition, test-driven development boosts trust in data throughout your organization, and overall makes the experience of working with data more pleasant and fun for everyone.

Cover image generated using FLUX.1-schnell on HuggingFace

DEV Community

Test-Driven Development For Analytics Engineering

What is test-driven development?

The meaning of trust

How TDD enhances your confidence in your own integrity

How TDD enhances your confidence in your own intent, capability, and results

How to do test-driven development

Start with the data the customer actually needs

Documentation-driven development

Edge cases that don't exist in production data

Enforcing test-driven development

What about TDD for data engineering?

Conclusion

Top comments (0)

Read next

Data Engineering with Scala: Mastering Real-Time Data Processing with Apache Flink and Google Pub/Sub

LLM Quantization: Balancing Accuracy and Efficiency for Real-World Deployments

Energy-Efficient Language Models: Addition is All You Need

FlashMask: Efficient Attention Masking for Enhanced Performance on Masked Tasks