This is the third part of a series about designing metrics for event-driven systems. You can check the first part and the second part of this series before proceeding.
While I discussed the general principles of designing metrics in the first part, I explained Prometheus metric types in the second part. I applied them as the RED method in the second part. In this article, I'll explain the USE method with Prometheus. Finally, a short discussion about the Four Golden Signals and a conclusion about all the methods.
Let's go...
The USE Method
The USE method by Brendan Gregg is a set of rules for designing metrics mainly used for the system not exposed to the users, like databases, message brokers, streaming platforms, etc.
Its key metrics are:
- Utilization - the level to which a resource has been used
- Errors - distribution of the number of errors per time
- Saturation - the level to which a resource has extra work which can not be handled. It has to wait or drop extra work.
Implementation
I'll make an example of the USE method observing a CPU, memory, and network to simplify things and be close to what we use in daily work. I did examples using docker-compose, Prometheus, and Grafana. To get metrics from the system, I'm using the node-exporter. The complete example is in my github repo.
CPU Utilization
CPU utilization is the percentage of time the CPU is busy. The node-exporter provides the node_cpu_seconds_total
metrics. This metric is a counter which counts the number of seconds the CPU has spent in each mode. One of the modes is idle
, which is when the CPU is not busy.
In a period, say 1m, observe an average change in the idle
counter. When subtracting a previously calculated value
from 1, we get the CPU utilization:
1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))
It is the same principle as in the RED method. We use counters, observe the rate of change, and then calculate the average.
If you are interested, continue to the rest on my blog.
Top comments (0)