Have you tried building tests or ML models to detect anomalies in your data warehouse? Turns out it’s ridiculously hard!
For example, say you have a table storing ingested 3rd-party data, which should keep getting new rows every day. You want to know if it stops growing, but how long should you wait for new data to be loaded before you get an alert? It’s a delicate balance. You definitely want to know as soon as possible that data aren’t loading, but you also don’t want to cause panic every time there’s a small delay.
But maybe you have to try this—despite the headache—because an API change caused your data to stop loading, and nobody knew for a week. Or a transformation job failed to run on schedule. Or maybe an application broke, nights were spent backfilling data, and you swore to never let it happen again.
Whatever the reason, this is how you can set up adaptive flatline alerts for proactive monitoring in your data systems.
Step function modeling
Let’s say you’ve set up a test to check the count of rows in an important table with ingested data. Your test probably has a few characteristics:
- You run it on a regular interval (maybe hourly)
- You usually get the same value over and over when the table isn’t changing, but then it jumps to a new plateau after an insert (maybe daily)
- There is some variation in the size and timing of the table changes
Because of these characteristics, your row counts follow an irregular step function, which means your model needs to account for both the increases and the flatlines. So, when there’s a change your model checks that it’s not too big, and when there’s a plateau your model checks that it’s not extending too long.
At this point, your model might look like the graph below. The blue lines are your row count values sampled at various points throughout the day, and the orange bounds are the acceptable ranges for the increases.
Notice how on November 27th, on the right side of the graph, the row count values flatline? This is where the logic needs to kick in to say “Hey, this table usually gets a daily insert but nothing’s happening—help!”
Flatline buffer
But maybe you don’t want to get an alert immediately after the 24th hour with no increase. Maybe the insert and the test run at the same time, and the test just happens to run before the insert is complete. Or maybe there’s some variance in the timing of the insert. Either way, you probably want the model to wait at least one hour to confirm that there’s really a problem before an alert yanks you out of your flow.
So, you establish some buffer time, either manually or with a model that learns the right amount of buffer to set. Because each table behaves differently and has different levels of importance for your data pipelines, you’ll want different buffers per table. Below is an example of a trend-based buffer, which uses the average historical trend (orange line) to determine how soon to alert:
See how the row count value (blue) has stayed flat but the model (orange line) is projecting that the row count should have increased after 24 hours? This is where you can look to determine how much buffer you want. As long as the value stays within the prediction bounds (orange area), no alert will trigger.
At Metaplane, we have two options for you to tune this buffer:
- Sensitivity: Increasing sensitivity would raise the lower bound so the alert would sound sooner.
- Mark as normal: If the model alerts too soon you can mark the flatline as normal with a click, and the model will learn to wait longer to alert during flatlines.
Navigating the fine line between staying informed and being overwhelmed by alerts is an art, especially with high-stakes data integrity issues. So, we invite you to compare your alerting results to our ML-based anomaly detection!
Metaplane’s flatline detection algorithm learns about your data over time, tailoring itself to be more meaningful and actionable for your context. Go on, put it to the test! Create an account and get set up within 30 minutes today.
Top comments (0)