This is the third part of a series about designing metrics for event-driven systems. You can check the first part and the second part of this series before proceeding.

While I discussed the general principles of designing metrics in the first part, I explained Prometheus metric types in the second part. I applied them as the RED method in the second part. In this article, I’ll explain the USE method with Prometheus. Finally, a short discussion about the Four Golden Signals and a conclusion about all the methods.

Let’s go…

The USE Method

The USE method by Brendan Gregg is a set of rules for designing metrics mainly used for the system not exposed to the users, like databases, message brokers, streaming platforms, etc. Its key metrics are:

  • Utilization - the level to which a resource has been used
  • Errors - distribution of the number of errors per time
  • Saturation - the level to which a resource has extra work which can not be handled. It has to wait or drop extra work.

Implementation

I’ll make an example of the USE method observing a CPU, memory, and network to simplify things and be close to what we use in daily work. I did examples using docker-compose, Prometheus, and Grafana. To get metrics from the system, I’m using the node-exporter. The complete example is in my github repo.

CPU Utilization

CPU utilization is the percentage of time the CPU is busy. The node-exporter provides the node_cpu_seconds_total metrics. This metric is a counter which counts the number of seconds the CPU has spent in each mode. One of the modes is idle, which is when the CPU is not busy.

In a period, say 1m, observe an average change in the idle counter. When subtracting a previously calculated value from 1, we get the CPU utilization:

1 - avg(rate(node_cpu_seconds_total{mode="idle"}[1m]))

It is the same principle as in the RED method. We use counters, observe the rate of change, and then calculate the average.

CPU Saturation

The node-exporter provides three metrics for average CPU load: node_load1, node_load5, and node_load15. The number in the metric name is the number of minutes the average is calculated for. In one minute average CPU load is node_load1. This is Gauge metric type. Now we also need to calculate the number of CPUs in the system. We can get this information from the node_cpu_seconds_total metric:

count by (instance) (sum by (instance, cpu) (
          node_cpu_seconds_total{job="..."}
        ))

Now we can calculate the saturation:

sum(node_load1) / count by (instance) (sum by (instance, cpu) (
    node_cpu_seconds_total{job="..."}
    ))

On different places you can see it as:

sum(node_load1) / sum(node:node_num_cpu:sum)

This node:node_num_cpu:sum is a metric Prometheus provides as a recording rule. They are used to pre-calculate metrics which are used in many places. You can find more about them in the Prometheus documentation.

Memory Utilization

The node-exporter provides memory metrics for free memory, buffers, cache, and total. So utilization would be calculated as 1 - (free + buffers + cache) / total:

1 - sum(node_memory_MemFree_bytes{job="..."} + node_memory_Buffers_bytes{job="..."} + node_memory_Cached_bytes{job="..."}) 
/ sum(node_memory_MemTotal_bytes{job="..."})

Simple.

Memory Saturation

While the previous calculation was simple, this one is a bit more complicated. There are no metrics that can help us to calculate memory saturation. One of the reasons is memory swapping. When the system needs extra memory, it can swap some of the memory to disk. To get the memory saturation, we can approximate it by observing the memory swapping.

The node-exporter provides node_vmstat_pgpgin and node_vmstat_pgpgout metrics. As these metrics are counters, we can do the calculation as:

1e3 * sum(
  rate(node_vmstat_pgpgin{job=""}[1m]) +
  rate(node_vmstat_pgpgout{job=""}[1m]))
)

Not perfect, but good enough.

CPU and Memory Errors

No metrics can help us calculate CPU and memory errors. So we can not calculate them.

Network Utilization

In this case, we can not precisely calculate the network utilization. But we can express the load that is put on the network. For this, node-exporter provides node_network_receive_bytes_total and node_network_transmit_bytes_total. As there is no metric that expresses total network bandwidth, we can represent network utilization as:

sum(rate(node_network_receive_bytes_total{job="..."}[1m]) + rate(node_network_transmit_bytes_total{job="..."}[1m]))

Network Saturation

Similar problem as above, it is hard to get network saturation without knowledge of memory capacity. The node-exporter exposes metric for dropped packets node_network_receive_drop_total and node_network_transmit_drop_total. We can use these two to calculate the saturation:

sum(rate(node_network_receive_drop_total[5m])) + sum(rate(node_network_transmit_drop_total[5m]))

The above expression would represent the number of packets dropped due to network saturation.

Network Errors

The node exporter provides node_network_receive_errs_total and node_network_transmit_errs_total metrics expressing the errors on the network. With those two, we can calculate the error rate when receiving and transmitting:

sum(rate(node_network_receive_errs_total[5m]))/sum(rate(node_network_receive_bytes_total[5m]))

sum(rate(node_network_transmit_errs_total[5m]))/sum(rate(node_network_transmit_bytes_total[5m]))

The Four Golden Signals

When designing metrics for an event-driven system, you’ll obviously have to combine the RED and USE methods. For different system components, you’ll use the appropriate method.

The Four Golden Signals is a crossbreed of the RED and USE methods.

Also, notice that sometimes it is impossible to calculate all the metrics for a system component. Also, sometimes some approximation is needed, like for network utilization and saturation or memory saturation. But it is better to have some approximation than nothing.

Prometheus is a great tool, but it has its quirks. What I put here are just basic examples. An effort was put into trying to explain the principles and the concept.

Whether you like it or not or have a comment drop me a mail. Thank you.

Resources