Event-driven systems at the moment are dominating when it comes to software system design. Some of the characteristics of event-driven systems are asynchronous
actions and eventual consistency.
In traditional systems, each call will produce an immediate response. While in event-driven systems(EDS) the response is not immediate but it is made when
the system is ready to process a call. The EDS is not easy to debug or predict the next state. But we can try to understand what is happening in the system by looking
at the events produced by the system.
I'll try to explain the motivation behind the metrics, the key metrics, and how to design metrics for the EDS. As well as some of the methodologies
used in practice.
What is Metric?
The measurement of a quantitative attribute of a system would be metric.
The system produces the metric. For the EDS we are adding the time when the measurement happens into account. In the end, the metric is a triplet of name,
value, and timestamp. The name is a unique identifier of the metric.
What can I do with metrics?
In essence, metrics are for decision-making, for example:
- data processing is slow: refactor and improve performance
- there are too many calls to the system: scale up
- there are too few calls to the system: scale down
- some features are used more than others: improve them
- some features are not used at all: remove them
- too many errors: let's call someone to fix it
I think these sound familiar.
Looking at the metrics over time gives us the ability to detect and predict problems in the system. It is crucial for further development and maintenance. The
day-to-day metric usage is for monitoring and alerting. The same metric can be used for resource planning, incident analysis, SLOs and SLIs, and many more.
Metric design
The metric type and what needs to be measured depends on of system component. The EDS is composed of multiple components with different natures. Each of those
components has characteristics that will drive metric design.
For example, the metrics for the web server facing the users will have different metrics than the event streaming platforms, like Apache Kafka, or the database.
The metrics for the web server will target the user experience, like request duration, request rate, request size, error rate, etc.
While, metrics for the event streaming platforms will target the system health, like the number of messages in the topic, message size, number of received messages,
the number of sent messages, and so on.
Databases will have different metrics, like the number of queries, query duration, number of rows returned, errors, etc.
For all mentioned components we can have metrics for the system health, like CPU usage, memory usage, disk usage, network usage, etc. These metrics are common for all.
Metric methodologies
Taking all, we can say that the key metrics are:
- Latency, or duration - distribution of time it takes to complete an action
- Traffic, or rate - distribution of the number of actions per time
- Errors - distribution of the number of errors per time
- Saturation - the resource use level
If you are interested in this topic, please continue reading the article on my blog.
Top comments (0)