Event-driven systems at the moment are dominating when it comes to software system design. Some of the characteristics of event-driven systems are asynchronous actions and eventual consistency.

In traditional systems, each call will produce an immediate response. While in event-driven systems(EDS) the response is not immediate but it is made when the system is ready to process a call. The EDS is not easy to debug or predict the next state. But we can try to understand what is happening in the system by looking at the events produced by the system.

I’ll try to explain the motivation behind the metrics, the key metrics, and how to design metrics for the EDS. As well as some of the methodologies used in practice.

What is Metric?

The measurement of a quantitative attribute of a system would be metric.

The system produces the metric. For the EDS we are adding the time when the measurement happens into account. In the end, the metric is a triplet of name, value, and timestamp. The name is a unique identifier of the metric.

What can I do with metrics?

In essence, metrics are for decision-making, for example:

  • data processing is slow: refactor and improve performance
  • there are too many calls to the system: scale up
  • there are too few calls to the system: scale down
  • some features are used more than others: improve them
  • some features are not used at all: remove them
  • too many errors: let’s call someone to fix it

I think these sound familiar.

Looking at the metrics over time gives us the ability to detect and predict problems in the system. It is crucial for further development and maintenance. The day-to-day metric usage is for monitoring and alerting. The same metric can be used for resource planning, incident analysis, SLOs and SLIs, and many more.

Metric design

The metric type and what needs to be measured depends on of system component. The EDS is composed of multiple components with different natures. Each of those components has characteristics that will drive metric design.

For example, the metrics for the web server facing the users will have different metrics than the event streaming platforms, like Apache Kafka, or the database.

The metrics for the web server will target the user experience, like request duration, request rate, request size, error rate, etc.

While, metrics for the event streaming platforms will target the system health, like the number of messages in the topic, message size, number of received messages, the number of sent messages, and so on.

Databases will have different metrics, like the number of queries, query duration, number of rows returned, errors, etc.

For all mentioned components we can have metrics for the system health, like CPU usage, memory usage, disk usage, network usage, etc. These metrics are common for all.

Metric methodologies

Taking all, we can say that the key metrics are:

  • Latency, or duration - distribution of time it takes to complete an action
  • Traffic, or rate - distribution of the number of actions per time
  • Errors - distribution of the number of errors per time
  • Saturation - the resource use level

The RED method

The RED method by Waveworks is a set of rules for designing metrics for the system facing the users. The RED method asks three questions:

  • What is the rate of requests?
  • What is the rate of errors?
  • What is the request duration?

Or three key metrics:

  • Request Rate
  • Request Error rate
  • Request Duration

It is simple and easy to understand. As well easy to implement and nicely fits in the microservices architecture. With industry standard tools like Prometheus and Grafana it is easy to implement. Once created alerts and dashboards can be reused for all services.

While the RED method is simple, IMHO it has one catch. The error rate is easy to measure but it can hide the fact that there are no calls. For example, if the service has an error rate of 0% that does not mean the service is healthy. It could be that the service is not used at all. No calls mean no errors. Because of that, you need to watch for the request rate value as well.

It would be better to replace the error rate with the success rate. So, the question should be: What is the rate of successful requests? The success rate fits better because it exposes the fact that there are no calls. As well it fits nicely in SLOs and SLIs. Just do not calculate it as 100% - error rate.

The USE method

The USE method by Brendan Gregg is a set of rules for designing metrics for the system not exposed to the users. Like databases, message brokers, streaming platforms, etc. As well it is easy to implement it with industry-standard tools like Prometheus and Grafana.

The USE method is based on three questions:

  • What is the utilization of the resource?
  • What is the rate of errors?
  • What is the rate of saturation?

Its key metrics are:

  • Utilization - the level to which a resource has been used, it can be expressed as a percentage from 0 to 100%
  • Errors - distribution of the number of errors per time
  • Saturation - the level to which a resource has extra work which can not be handled, so it is in the waiting queue

Notice that in this case, saturation has a different definition than before. Initial saturation definition was the level to which a resource has been used, which is like utilization metrics in the USE method. Keep this in mind, when applying this method.

Intuitively saturation is applied to resources like CPU, memory, disk, and network. But it can be applied to the whole system as well. For example, a service that process events from the queue is saturated when events are pilling up in the queue. Then the saturation is above 100%.

The Four Golden Signals method

The Four Golden Signals by Google is a set of rules for designing metrics based on four questions:

  • What is the service’s latency?
  • What is the service’s traffic?
  • What is the error rate of the service?
  • What is the saturation of the service?

From these questions, we can define four key metrics:

  • Latency - distribution of time it takes to complete an action
  • Traffic - distribution of the number of actions per time
  • Errors - distribution of the number of errors per time
  • Saturation - the level to which a resource has been used

It is like the RED method plus the saturation. The saturation is a good add-on because every system has some resources that can be saturated. For example, even if service can serve more than double the number of current requests, it can be limited by CPU saturation.

The Four Golden Signals method is for user-facing services. Because it has a saturation metric, it can be used for non-user-facing services as well. It is a good compromise when choosing to use only one method.

IMHO, I have the same comment on the error metric as for the RED method. It is easy to measure, but it can hide the fact that there are no calls.

Conclusion

I hope this article will help you to understand the metrics. Based on the system you are designing, you should be able to choose the right method for your system.

Next time I’ll look at Prometheus metrics types and try to explain them. Any feedback is welcome. Thank you for reading.

References