These are the notes from Chapter 10: Practical Alerting from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
This post is rather short, as the chapter is mostly about how Borgmon is used within Google as a tool for monitoring and alerts. Borgmon is a monitoring system similar to Prometheus, or should I say the inverse since Prometheus was born taking inspiration from Borgmon.
The most interesting part of this chapter comes from the section Storage in the Time-Series Arena, where it gives a visual explanation of how a time-series database work under the hood.
a time-series is conceptually a one-dimensional matrix of numbers, progressing through time. As you add permutations of labels to this time-series, the matrix becomes multidimensional.
It’s super interesting! The content here is valuable to have an overview of how time-series databases work under the hood and their trade-offs. It can be read as an independent chapter without suffering from a lack of context.
I would like to dive more into the topic, but it would derail the purpose of this post. Hopefully, in the near future, I can come back to write about it.
Rather than requiring management of many individual components, a large system should be designed to aggregate signals and prune outliers. We need monitoring systems that allow us to alert for high-level service objectives, but retain the granularity to inspect individual components as needed.
Ensuring that the cost of maintenance scales sublinearly with the size of the service is key to making monitoring (and all sustaining operations work) maintainable.
If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.
You can also follow me on Twitter and Mastodon.
Photo by Kevin Grieve on Unsplash
Top comments (0)