Many data professionals, myself included, have had to rethink the way we work in the aftermath of the 2022-2023 interest rate spike. The new industry-wide reality of smaller teams, higher pressure, and higher turnover forces a renewed focus on simplicity. A simple data architecture is a great starting point for all organizations. Saying that something is a "best practice" is no longer enough to justify additional processes and tools. Complexity should only be introduced if absolutely necessary to meet business needs.
Table of contents
This post is a comprehensive collection of "simplest practices" that can be used to build a data warehouse from the ground up. These practices can be grouped into 4 sections:
The diagram
Diagram inspired by the Krazam video High Agency Individual Contributor. 😁 Made with Inkscape using clipart from Vecteezy.
Defining simplicity
There are many measures of quality of a data architecture, including satisfying requirements, correctness, cost effectiveness, compliance, openness, and many more. Simplicity is only one of these measures. But what exactly are we measuring when we talk about simplicity?
I define simplicity as a tradeoff between value, effort, and moving parts:
- Maximize the value of data products delivered to stakeholders.
- Minimize the necessary effort to document, learn, contribute to, and maintain the data architecture.
- Minimize the number of moving parts, such as technologies, processes, data structures, files, and lines of code.
Use Snowflake as your data warehouse
As much as I love the idea of just using PostgreSQL, I know that I'd spend countless hours troubleshooting sharding, checkpointing, bloat, vacuuming, and more performance issues. Managing the performance of a data warehouse database can easily become a full-time job (and in many organizations, does).
Snowflake eliminates this need with its virtual warehouses that can be scaled up or out and support for dozens of concurrent queries on the same table through micropartitioning. Add to that a user-friendly UI and a growing list of features focused on modern data warehousing, and it becomes really hard to beat Snowflake in terms of delivering high value for minimal effort.
(I still want to set up an open-source data warehouse myself at least once though)
Manage infrastructure through Terraform/OpenTofu
Infrastructure is a great example of the tradeoff between minimizing effort and minimizing the number of moving parts. If you're standing up a quick single-user Postgres database to use dbt locally, Terraform is definitely overkill. A small startup of less than 10 people probably doesn't need Terraform. However, the balance tips toward infrastructure-as-code even for a small organization of around 100 people. Once your organization starts having separate departments, suddenly you need multiple databases, multiple schemas, auditable access controls, review environments, scalable compute, and potentially even cloud integrations. It's definitely possible to do all this manually, but the effort to document and enforce the standards quickly begins to outweigh the effort to stand up, learn, and maintain an infrastructure-as-code solution.
Due to Terraform changing its license in 2023, a truly open-source fork called OpenTofu was created. Though I'll keep using the term "Terraform" below to prevent confusion, I do recommend OpenTofu over Terraform in your implementation.
If your organization uses Snowflake, you can use Terraform/OpenTofu to define your databases, schemas, roles, warehouses, and permissions. You can additionally use Terraform to create personal environments for each developer, as well as create review environments for each Git pull request.
Simplest practice: Minimize the amount of data access roles
I once had the ambition to implement perfect role-based access in Snowflake. For each functional role that needs to access the data warehouse, there'd be a corresponding Snowflake role. That way, permissions could be fine-tuned for each role.
In practice, this utopian vision ended up as a huge Terraform file with the same access being copied and pasted over and over and over. The Terraform updates are super slow because each role-schema pair is an object Terraform has to keep track of and manage. The number of these combinations easily went into the thousands. Not to mention all the constant requests for data access from different groups...
My current thinking is that you should start with just two roles: "Reporter" and "Developer". Reporters can only see production data (which could include most raw data depending on your organization's culture). Developers can additionally see and create non-production data. Start there and only add roles as absolutely necessary.
Simplest practice: Create personal development schemas in Terraform
If you maintain the list of users that have the Developer role inside of Terraform, you can simply iterate over it to create the corresponding personal development schemas they can do their development work in. For example, in Snowflake:
resource "snowflake_schema" "personal_dev_schemas" {
for_each = toset(local.developers)
database = snowflake_database.database.name
name = "DEV_${each.key}"
}
resource "snowflake_grant_privileges_to_account_role" "personal_dev_schema_grants" {
for_each = snowflake_schema.personal_dev_schemas
privileges = ["USAGE", "MONITOR", "CREATE TABLE", ...]
account_role_name = "DEVELOPER"
on_schema {
schema_name = each.value.fully_qualified_name
}
}
If you don't maintain the list of developer users within Terraform, you can get it directly from Snowflake by checking which users have been granted the Developer role via the snowflake_grants data source
Simplest practice: Use a separate Terraform state for pull request resources
By pull request resources here, I mean resources that are specific to a pull request, usually containing the pull request's number somewhere, not just any resource created in a pull request. For example, a schema like dev_pr_123
for storing data for the dbt run in pull request 123. This practice is essential to keep your PR pipeline results consistent.
The Terraform http data source can be used to retrieve the list of open merge requests and create the corresponding schemas. Here's an example with GitLab and Snowflake:
data "http" "gitlab_merge_requests" {
url = "${var.gitlab_api_url}/projects/${var.gitlab_project_id}/merge_requests?state=opened&sort=desc&per_page=100"
request_headers = {
Accept = "application/json"
Authorization = "Bearer ${var.gitlab_api_token}"
}
resource "snowflake_schema" "merge_request_schemas" {
for_each = toset([for mr in jsondecode(data.http.gitlab_merge_requests.response_body) : mr.iid])
database = snowflake_database.database.name
name = "DEV_MR_${each.key}"
}
Note that you want these resources to be in a separate Terraform state from your main one. If you put your merge requests resources in your main state, new merge request pipelines will constantly overwrite your main state, making it painful to try to get any actual Terraform debugging and development done.
Use off-the-shelf data pipelines when possible
Data engineering pipelines are expensive to develop and maintain. Requests to the data engineering team can take weeks or even months to get done. Using off-the-shelf solutions can keep costs low and value high.
Fivetran is the best-known name in the space of extract-and-load you can just pay for. However, there is some exciting ongoing competition in this space. As of the writing of this article, Snowflake itself came out with a free connector for PostgreSQL, and there are more connectors by various companies popping up all the time on the Snowflake marketplace.
Being up-to-date on the off-the-shelf data connectors that are available out there can be a huge value-add and differentiator for any data engineer. Not to mention, it also gives you time to focus on more important high-level problems.
Use CI and self-hosted runners instead of an orchestrator
Historically, teams that have used CI/CD still used a separate orchestration tool. The CI pipeline deployed to the orchestration tool, which actually did the work on a schedule.
However, using a separate orchestrator introduces extra complexity:
- Pull request checks: How do we know orchestration logic for a pull request actually succeeded? We could leave the CI job spinning waiting for the orchestrator, but then we're wasting compute on just spinning and waiting. We could use a service account that approves pull requests, but that's complex to set up and debug.
- Access and learning curve: The separate orchestrator requires access provisioning. Paid solutions charge per seat. Debugging requires folks to jump between CI and the orchestrator.
- Reproducibility: If your code is tied to an orchestrator, it may be difficult for you and others to identify and reproduce issues. For example, suppose you're consuming an API from a business partner, and there's an issue. Is the issue with the API or with the orchestration? You could get stuck arguing about it back-and-forth since the business partner won't want to install your orchestrator to reproduce the issue.
Prerequisite: Self-host your CI compute
Compute time for CI tools is notoriously expensive. Compare GitLab CI's $0.60/hour to AWS EC2's $0.05/hour (this is further exacerbated by the fact that GitLab charges for the time of each job separately while EC2 can execute multiple jobs in one instance). Luckily most major CI platforms provide a way to self-host that compute:
- GitLab CI self-managed runners
- GitHub Actions self-hosted runners
- Azure DevOps self-hosted agents
Simplest practice: Use CI instead of an orchestrator
In recent years, CI tools have steadily adopted more and more features from orchestrators, making it completely viable (assuming you self-host the compute - see above) to run a sophisticated data pipeline directly from your CI tool of choice.
Running pipelines on a schedule:
- GitLab CI scheduled pipelines
- GitHub Actions schedule event
- Azure DevOps pipeline schedules
Excluding certain jobs (e.g. Terraform) from the scheduled run:
- GitLab CI
rules:if
checking whetherCI_PIPELINE_SOURCE
is equal toschedule
- GitHub Actions
if
checking whethergithub.event_name
is equal toschedule
- Azure DevOps pipeline conditions with
ne(variables['Build.Reason'], 'Schedule')
Running multiple copies of a job in parallel:
- GitLab CI parallel runs
- GitHub Actions matrix strategy
- Azure DevOps matrix strategy
Triggering downstream pipelines:
- GitLab CI pipeline trigger API
- GitHub Actions triggering a workflow from a workflow
- Azure DevOps pipeline triggers
The above building blocks should be sufficient to run almost any batch-based parallelized data ingestion job.
Simplest practice: Load data in batches, avoid streaming.
If you have an off-the-shelf connector that streams data into your warehouse, go ahead and use it! However, if you have to build an extract-and-tool process from scratch, avoid streaming unless there's a use case for it.
Building and debugging streaming infrastructure is expensive. Let's take Apache Kafka as an example. It requires DevOps expertise and time to properly set up ZooKeeper, 3+ broker nodes, plus an additional Kafka Connect server. It also takes expertise to utilize the Kafka Connect API (being cautious of potential pitfalls like Kafka Connect's default buffering behavior), to write custom code that sends data to a Kafka topic, and to troubleshoot any issues.
Unless there's a clear business need that can justify both the upfront expense of standing up streaming infrastructure and the ongoing expense to maintain it, it's better to stick to batch-based extract-and-load. Batch processes can be invoked as scripts without having to worry about streaming infrastructure or streaming race conditions. This makes it possible to call them from any orchestrator or CI pipeline.
Simplest practice: Modularize your data engineering code into command-line scripts
When using an off-the-shelf data connector is not an option, we have to write our own extract-and-load code in a language like Python.
Use Python's argparse library (or the corresponding library for your language of choice) to add command line capability to your Python functions. This allows each function to be called both as a library function from other Python code and also directly from the command line. This makes your code debuggable, modular, and easy to call from a CI script.
Example Python file scripts/python/extract_data.py
:
import argparse
import logging
import os
def extract_data(api_name, ...):
api_key = os.env["API_KEY"]
...
return data_file
if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog="extract_data",
description="CLI command to extract data.",
)
parser.add_argument(
"-a",
"--api",
help="Name of the API to extract data from.",
required=True,
)
parser.add_argument(
"-v",
"--verbose",
help="Be verbose. Include once for INFO output, twice for DEBUG output.",
action="count",
default=0,
)
args = parser.parse_args()
LOGGING_LEVELS = [logging.WARNING, logging.INFO, logging.DEBUG]
logging.basicConfig(level=LOGGING_LEVELS[min(args.verbose, len(LOGGING_LEVELS) - 1)]) # cap to last level index
data_file = extract_api(api_name=args.api)
print(data_file)
Example invocation:
$ export API_KEY='...'
$ python3 -m scripts.python.extract_data -a transactions -vv
This kind of script can be called directly in a CI job, used for easy debugging from the terminal, and shared with other teams and business partners.
I've found a ton of value in being able to save or send a single command-line snippet for reproducing a problem. Without this ability, I've had to gut and rewrite my Python functions to debug, which has sometimes introduced new bugs itself, and was very difficult to explain to others, or even understand myself after a few months.
Containerize your data pipelines
Containerization has exploded since the early 2010s. Arguments have been made that containers have been used in many areas where they don't necessarily make sense, and have their own overhead and learning curve, so using containers isn't always the simplest practice in every situation.
I do believe that using containers makes a ton of sense in writing data pipelines. You can use the same image to develop and run the pipeline, preventing "it works on my machine" issues. You can test different variations of the image without having to stand up additional infrastructure or potentially breaking the workflows of others who're using the same infrastructure. Finally, knowledge of containerization is increasingly expected of all engineers, while knowledge of other tools that solve similar issues (like Vagrant or Ansible) is less common.
Simplest practice: Use the same Dockerfile for development and production
If you use different Dockerfiles for developing (e.g. in VS Code Dev Containers or GitHub Codespaces or Gitpod) and for production runs, the Dockerfiles inevitably end up diverging, causing unexpected bugs. At the same time, if your development and production Docker images are identical, your production image will be bloated by tools that are needed only for development.
The solution is to use the same Dockerfile to build two different images. We can achieve this by using a Docker build argument IN_CI
.
FROM python:slim
ARG IN_CI=false
# Install apt packages
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
sudo \
&& ( \
if [ "$IN_CI" = 'false' ]; then \
apt-get install -y --no-install-recommends \
git \
less \
wget \
; fi \
) \
# Clean up
&& apt-get clean -y \
&& rm -rf /var/lib/apt/lists/*
The build argument IN_CI
is set to false by default, which installs development dependencies git
, less
, and wget
. When building the image in our CI pipeline, we pass in -build-arg IN_CI=true
, which then skips installing those development dependencies, keeping the production image slim.
Simplest practice: Use pipx for command-line Python tools
If you're using Python to write your data engineering pipelines while also using a Python-based command line tool like dbt, you may have noticed a frustrating thing: Python command line tools can have a lot of dependencies, some of them potentially conflicting with the versions of data engineering packages you want to use.
You can use isolated Python environments like venv or conda. If you do this, you'll have to manage your environments yourself, and also constantly switch between them to run your data engineering code vs dbt.
On the other hand, pipx allows you can keep your root Python environment for data engineering, and install command-line tools like dbt into isolated environments automagically. For example, to install the dbt command line tool for Snowflake, run:
pipx install --include-deps dbt-snowflake
This installs the package dbt-snowflake
into an isolated environment that won't conflict with your data engineering packages, while still exposing the dbt
command-line tool for you to use regardless of which Python environment you're in.
Note that in the example above, the package dbt-snowflake
doesn't contain the dbt
command line tool, but its dependency dbt
does, which is why we had to use the flag --include-deps
. See the pipx docs for more information.
Now what if you need to install multiple command line tools that also need to be in the same environment?
In the case of Elementary, dbt-snowflake
is actually already a dependency of elementary-data[snowflake]
, so the following will install both in the same environment:
pipx install --include-deps elementary-data[snowflake]
Otherwise, you can use pipx inject
to inject one package into another's environment. See the pipx docs for more information.
Simplest practice: Freeze your Python and pipx dependencies
Freezing dependencies is not the simplest in terms of moving pieces, but definitely simplest in minimizing effort spent on debugging outages because someone in the 10+ layers of Python dependencies in your stack decided to upgrade their package and break downstream on a weekend.
Tools like Poetry aim to fix this problem, but vanilla pip can do just fine.
Suppose you have a file requirements.txt
that contains your Python dependencies. First, install them locally:
$ pip install -r requirements.txt
Then freeze the exact versions of all your packages with pip freeze:
$ pip freeze > requirements-frozen.txt
Finally, in your image build process, install the frozen requirements instead of the source requirements:
$ pip install --no-deps -r requirements-frozen.txt
Note that you'll have to manually update requirements-frozen.txt
every time you change or upgrade packages in requirements.txt
- it won't happen automatically!
Freezing requirements in pipx works similarly. We first freeze the dependencies, but then provide them as constraints, not directly. For example, if you've created your dbt/Elementary environment locally with pipx install --include-deps elementary-data[snowflake]
, you can create a constraints file like so:
$ pipx runpip elementary-data freeze > "pipx-dbt-constraints.txt"
Then in your image build process, provide the constraints to your pipx install
command:
$ pipx install --include-deps elementary-data[snowflake] --pip-args='--constraint pipx-dbt-constraints.txt'
Now you won't have to worry about a Python package update borking your pipeline again!
Use compressed CSVs for loading raw data into Snowflake
Since the 2000's, many data serialization protocols have been developed that promise superior compression and performance. Snowflake supports the big three Apache protocols: Avro, ORC, and Parquet. We also can't forget about Google's Protobuf that started it all. While using these modern formats has been touted as a best practice for data engineering, how good are they really? Suppose you need to do some debugging on a corrupt Parquet file or an Avro file with missing data. Even when I was using these formats daily, I wouldn't be able to do this kind of debugging without a lot of research and custom work.
Instead, dump the rawest form of the data you're loading into a JSON string inside a gzip-compressed CSV. The file sizes and performance are mostly on par to the fanciest formats above. And if you ever need to troubleshoot the resulting file, you can just use the zcat
utility that comes preinstalled on most Linux distributions to peek inside:
$ zcat data/data-file.csv.gz | less
Simplest practice: Export raw data into compressed CSVs
Python example:
import csv
import gzip
from pathlib import Path
import pytz
def extract_data(...):
data_file = Path("data") / "data-2024.csv.gz"
with gzip.open(data_file, "wt") as gzip_file:
csv_writer = csv.writer(gzip_file)
# Write header
csv_writer.writerow(["value", "process_started_at_utc", "process_ended_at_utc"])
for response, started_at, ended_at in get_data(...):
csv_writer.writerow(
[
json.dumps(response, separators=(",", ":")),
started_at.astimezone(pytz.utc).isoformat(),
ended_at.astimezone(pytz.utc).isoformat(),
]
)
return data_file
Simplest practice: Have a single utility script load any compressed CSV to Snowflake
See the complete script here: load_to_snowflake.py
For example, to reload all data for the year 2024 in a single Snowflake transaction:
$ python3 -m scripts.python.load_to_snowflake \
--schema my_schema \
--table my_table \
--file data/data-2024.csv.gz \
--delete-filter "value:start_date::date >= '2024-01-01'::date" \
--verbose
This "delete filter" functionality enables partially incremental loads where we need to truncate and reload a part of the table. To do a pure incremental load, omit the delete filter. To do a full truncate-and-load, set the delete filter to true
or 1=1
.
Note that this process doesn't stage the data in S3 first. Storing the raw files somewhere like S3 is a common "best practice" that I myself have followed for quite some time. I've never really found the value in it. All those old files that no one ever looks at are just slowly costing us more and more in storage costs. If we have to go back and see what the previously loaded data looked like, we can just use Snowflake's time travel instead. So in the spirit of minimizing moving parts, I load data directly into Snowflake now.
Transform your data with dbt
dbt has taken the data industry by storm in recent years. dbt is an open source data transformation framework, empowering users to get started right away with minimal knowledge, but also leaving a ton of options and configurations for more advanced use cases.
You can learn more about dbt by browsing its docs. You can also browse a real-life example: GitLab's dbt project.
Simplest practice: Use dbt Cloud
The company behind dbt, dbt Labs, offers dbt as a managed service called dbt Cloud. In terms of minimizing effort and minimizing moving pieces, paying for dbt Cloud is a clear winner over running the open-source dbt Core yourself.
This is an example of how simplicity isn't everything. Any data SaaS product that charges per seat is at odds with the values of openness and inclusivity, as it either limits the access to that product to a select few individuals, or becomes unjustifiably expensive if blanket-given to the whole org. Limiting access to a circle of people makes it harder for individuals outside that circle to explore the data documentation and lineage.
In terms of pure simplicity, however, dbt Cloud is the clear choice.
Simplest practice: Empower self-service
Every hour spent empowering folks to handle their own data needs can save dozens of hours spent responding to tickets in the future. This effort also upskills the entire organization and increases its velocity.
Data science teams especially are under a lot of deadline pressure to try new things, experiment with new products, and deliver concrete financial value to the business. These teams are frequently unable to wait even a week for data/analytics engineering support. Data scientists will stand up their own infrastructure and data pipelines anyway, so you might as well empower them to do it your way.
A focus on simplicity also turns into a virtuous cycle here, because the simpler your data architecture is, the easier it is to onboard other teams, the more time everyone saves.
Simplest practice: Model wide tables on top of dimensional models
Wide tables are the most popular modern alternative to dimensional modeling. Building spreadsheet-like wide tables directly on top of your raw data gives you the benefit of having as few moving pieces as it gets. However, I believe the effort spent on long-term maintenance of such wide tables outweighs that benefit.
I agree with proponents of wide tables that presenting the final data to the end user in a user-friendly spreadsheet-like format is a good thing. I bet that every single data professional has had to present data to end users in this format more than once. In implementations I've been part of, this was even considered its own layer of the data warehouse - "denormalized layer" or "activation layer".
In my experience, there's a ton of value in considering your wide tables (or wide views) your end product, but still building these wide tables on top of facts and dimensions, for several reasons:
- Creating a conceptual structure under your wide tables makes your data warehouse more modular, flexible, and reusable, allowing you to answer similar questions easily in the future without having to build everything from scratch again.
- Having to think about what the facts are and what their grains are forces analytics engineers to understand the business processes more deeply. This turns AEs into collaborators helping the data drives business value as opposed to code monkeys building whatever spreadsheet the end user requests.
- Using dimensional modeling as opposed to more normalized approaches like Inmon or Data Vault makes the data speak the language of the business. This enables end users to understand the underlying data structure and makes it easier for them to self-serve.
In modern data modeling, we have more flexibility and freedom than ever. Start with wide tables but don't stop there. Add concepts, structures, and processes when the benefit they promise in terms of reduced effort outweighs the costs of setting them up.
Simplest practice: Document requirements in dbt
Documenting data models frequently gets pushed to the end of a project, and then never gets done. However, it's actually very easy to document the data model during the requirements gathering process in dbt, kickstarting your development process with a bang!
All you have to do is create a models.yml file and document everything the end user tells you in the model's description. Then as you dive deeper into the column level, you can document what the user says about each column they need as well. After you've written the code, you already have a perfectly documented model!
I've had great results in taking this a step further and turning user-provided examples into automated dbt tests. It's easy to do test-driven development with dbt:
- Get an example of a requirement from the end user.
- Turn that example into a singular data test.
- Ensure the test fails since you haven't actually implemented the feature yet. You'd be surprised how often the test will inadvertently pass because of a mistake in the test itself...
- Implement the feature.
- Run the test again - it should now pass.
Example of how sample desired output from an end user can be turned into a dbt singular data test:
with row_count as (
select count(*) as cnt from {{ ref("my_model") }}
where id = 'example_id'
)
select 'Not exactly 1 row' from row_count where cnt <> 1
union all
select 'Row test failed' from {{ ref("my_model") }}
where id = 'example_id'
and not (
column1 = 'example_value_1'
and column2 = 'example_value_2'
and ...
)
Testing is another process that frequently gets pushed out to the end of the development process and then abandoned. When you follow test-driven development, your model will be perfectly tested as soon as your SQL is implemented.
In addition, following test-driven development prevents regressions. Regressions happen when implementing a new feature breaks an old feature. For example, when you rewrite your query logic to handle a new edge case, you inadvertently break the base case without realizing. Regressions can take dozens of hours to identify and debug, but with test-driven development your previous tests will instantly identify it for you.
Monitor your data with dbt Cloud
Nothing frustrates data consumers more than when the same data issues occur over and over and over again, and it's the data consumers catching them instead of the data team. The purpose of testing and alerting is to build trust with the data consumers.
Users of dbt Cloud can configure email and Slack alerts on failed jobs, which is all you really need. If you're using dbt Core, you can use the open-source tool Elementary to send alerts instead.
Simplest practice: Alert only on past issues
The ideal of data warehouse alerting is proactively catching and fixing all data pipeline issues before the downstream data consumers encounter them even once, but I don't believe this ideal is even remotely achievable. The biggest and nastiest data pipeline failures are the ones that leave you wondering how you could even test for them. Beautiful detailed plans containing SLAs, severity levels, and testing plans get drawn up and put on the back-burner, but they wouldn't catch many of these big issues even if they were perfectly implemented.
Put the following block in your dbt_project.yml
to make all tests only warn by default.
tests:
+severity: warn
After a downstream data consumer alerts you about a data issue and you fix it, then and only then create a test for that particular issue and set its severity to error
.
For some particularly nasty failures, you may even have to go outside of dbt and implement alerting in an external system or a Python script. Do whatever you have to do to make sure that you'll catch the same issue before the data consumer does next time. Don't worry too much about preserving the consistency of some imaginary testing or alerting strategy. Alerts don't have to be pretty, they just have to work.
Just because you don't have some utopian system that can detect any issue perfectly doesn't mean you'll lose your data consumers' trust. What will lose their trust is if they have to alert you about the same exact issue over and over. As long as you show them that they only have to show you an issue once for you to catch it yourself in the future, their trust will be only growing over time.
Top comments (0)