Ensuring that data consumers can use their data to reliably answer questions is of paramount importance to any data analytics team. Having a mechanism to enforce high data quality across datasets is therefore a key requirement for these teams.
Often, input data sources are missing rows, contain duplicates, or include just plain invalid data. Over time, changes to business definitions or the underlying software which produces input data can cause drift in the meaning of columns - or even the overall structure of tables. Addressing these issues is critical to creating a successful data team and generating valuable, correct insights.
In this article we explain the concept of a SQL data assertion, look at some common data quality problems, how to detect them, and - most importantly - how to fix them in a way that persists for all data consumers.
The SQL snippets in this post apply to Google BigQuery but can be ported easily enough to Redshift, Postgres or Snowflake data warehouses.
What is a data assertion?
A data assertion is a query that looks for problems in a dataset. If the query returns any rows then the assertion fails.
Data assertions are defined this way because it’s much easier to look for problems rather than the absence of them. It also means that assertion queries can themselves be used to quickly inspect the data causing the assertion to fail - making it easy to diagnose and fix the problem.
Checking field values
Let’s take a look at a simple example.
Assume that there is a database.customers
table containing information about customers in the database.
Some checks that we might want to verify on the table’s contents include:
- The field
email_address
is always set - The field
customer_type
is one of“business”
or“individual”
The following simple query will return any rows violating these rules:
SELECT customer_id
FROM database.customers
WHERE email_address IS NULL
OR NOT customer_type IN (“business”, “individual”)
Checking for unique fields
We may also want to run checks across more than one row. For example, we might want to verify that the customer_id
field is unique. A query like the following will return any duplicate customer_id
values:
SELECT
customer_id,
SUM(1) AS count
FROM database.customers
GROUP BY 1
HAVING count > 1
Combinining multiple assertions into a single query
We can combine all of the above into a single query to quickly find any customer_id
value violating one of our rules using UNION ALL
:
SELECT customer_id, “missing_email” AS reason
FROM database.customers
WHERE email_address IS NULL
UNION ALL
SELECT customer_id, “invalid_customer_type” AS reason
WHERE not customer_type in (“business”, “individual”)
FROM database.customers
UNION ALL
SELECT customer_id, “duplicate_id” AS reason
FROM (
SELECT customer_id, SUM(1) AS count
FROM database.customers
GROUP BY 1
)
WHERE count > 1
We now have one query we can run to detect any problems in our table, and we can easily add another unioned SELECT
statement if we want to add new conditions in the future.
Creating clean datasets
Now that we’ve detected the issues in our data, we need to clean them up. Ultimately choosing how to handle data quality issues depends on your business use case.
In this example we will:
- Remove any rows that are missing the
email_address
field - Set a default customer type if it’s invalid
- Remove rows with duplicate
customer_id
fields, retaining one row percustomer_id
value (we don’t care which one)
Rather than editing the dataset directly, we can create a new clean copy of the dataset - this gives us freedom to change or add rules in the future and avoids deleting any data.
The following SQL query defines a view of our database.customers
table in which invalid rows are removed, default customer types are set, and duplicate rows for the same customer_id
are removed:
SELECT
customer_id,
ANY_VALUE(email_address) AS email_address,
ANY_VALUE(
CASE customer_type
WHEN “individual” THEN “individual”
WHEN “business” THEN “business”
ELSE “unknown”
END) AS customer_type
FROM database.customers
WHERE NOT email_address IS NULL
GROUP BY 1
This query can be used to create either a view or a table in our cloud data warehouse, perhaps called database_clean.customers
, which can be consumed in dashboards or by analysts who want to query the data.
Now we've fixed the problem, we can check that the above query has correctly fixed the problems by re-running the original assertion on the new dataset.
Continuous data quality testing
Assertions should be run as part of any data pipelines to make sure breaking changes are picked up the moment they happen.
If an assertion returns any rows, future steps in a pipeline should either fail, or a notification delivered to the data owner.
Dataform has built in support for data assertions, and provides a way to run them as part of a larger SQL pipeline.
These can be run at any frequency, and if an assertion fails an email will be sent to notify you of the problem. Dataform also provides a way to easily create new datasets in your warehouse, making managing the process of cleaning and testing your data extremely straightforward.
For more information on how to start writing data assertions with Dataform, check out the assertions documentation guide for Dataform’s open-source framework, or create an account for free and start using Dataform's fully managed Web platform.
Top comments (0)