This is the second part of a series about designing metrics for event-driven systems. You can check the first part of this series.
Prometheus is open source system for monitoring and alerting. It is a part of CNCF (Cloud Native Computing Foundation) and it is one of the most popular monitoring systems. You can say it is a de facto standard for monitoring in Kubernetes.
To design metrics with Prometheus, you need to understand its metric types. In this article, I'll explain Prometheus metric types and how to design metrics with them.
Prometheus Metric Types
All data in Prometheus is stored as time series. Time series is a set of data points indexed by time. The data point is a tuple of a timestamp and value.
A time series is uniquely identified by its metric name and an optional set of key-value pairs called labels. Labels are used for Prometheus dimensional data model.
Prometheus has four metric types: Counter, Gauge, Histogram, and Summary.
Counter
For values that increase over time or can reset to zero on a restart, we use Counter. It is a cumulative metric that in essence is a monotonically increasing counter.
Some examples are the number of requests served, number of errors, number of bytes received, and so on. Use it for values that can only increase.
Gauge
If you have a single numerical value that can arbitrarily go up and down, you should use Gauge.
Some examples are temperature, current memory usage, current CPU usage, and so on. Use it for values that can increase and decrease.
Histogram
Histogram is a cumulative metric that represents the distribution of a set of values. It counts the number of observations in predefined buckets.
As well, it provides a sum of all observed values.
Histogram named <basename>
expose multiple metrics:
- counter for observed events: number of observations as Counter metric named
<basename>_count
- total sum of observed values: sum of all observations as Counter metric named
<basename>_sum
- cumulative counters for each bucket: number of observations per bucket as Counter metric named
<basename>_bucket{le="<upper inclusive bound>"}
Summary
Summary is a Histogram with the ability to calculate configurable quantiles over a sliding time window. It exposes:
- streaming f-quantiles (0 ≤ f ≤ 1) over a sliding time window as
<basename>{quantile="<f>"}
- total sum: sum of all observations as Counter metric named
<basename>_sum
- counter of all observed events: number of observations as Counter metric named
<basename>_count
There are subtle differences between Histogram and Summary. You can read more about them in Prometheus documentation.
In most cases, you would use Histogram.
Labels
As I already mentioned, labels are used for Prometheus dimensional data model. They are key-value pairs that are attached to a metric. Whenever some metric is
observed, we attach labels to it. They carry additional information about observed metrics. Later we use them for grouping and filtering.
Important to remember is not to use labels for values that can change over time. That means that each label should have a finite number of possible values. That number should be small. Smaller the better.
And try to keep the number of labels small. A large number of labels or a large number of label values can cause high cardinality. High cardinality is a problem because it can cause high memory usage and slow down Prometheus.
The RED Method and Prometheus Metric Types
Key metrics are:
- Request Rate
- Request Error rate
- Request Duration
If you would like to learn how to implement the RED with Prometheus continue reading here on my blog post.
Top comments (0)