Using Prometheus subquery for capturing spikes

If we are working with monitoring systems, we usually want to know if we have some unusual behavior in our graphs, for example if disk I/O graph is briefly increased. This behavior is called spikes. But how can we catch the spikes correctly if we use Prometheus in our infrastructure?

Prometheus is a TSDB (time series database), it can export data to monitoring systems such as Grafana. Prometheus has 4 types of metrics:

  • Gauge
  • Counters
  • Histogram
  • Summary

Gauge is a metric that represents a single numerical value. If we want to monitor the spikes of a particular gauge metric, it can be captured pretty easily: we just need to use a max_over_time function.

max_over_time(node_load[interval])

But if talking about monitoring http_request_total, for example, we can’t use a stand-alone max_over_time function. Metrics like http_request_total are counters. A counter is a cumulative metric that represents a single monotonically increasing counter with value which can only increase or be reset to zero on restart. Previously if we wanted to combine over_time functions (avg,max,min) and some rate functions, we needed to compose a range of vectors, but since Prometheus 2.7.0 we are able to use a subquery for these purposes.

Basic sample of subquery looks like this (according to prometheus documentation)

<instant_query> '[' <range> ':' [ <resolution> ] ']' [ offset <duration> ]

<instant_query> is equivalent to query field in /query_range API.
<range> and offset <duration> is similar to a range selector.
<resolution> is optional, which is equivalent to step in /query_range API.

In practice it may look like this:

max_over_time( rate(http_requests_total{status="500"}[1m]) [5m:1m] )

Subquery shows us a maximum rate of 500 error codes from the last 5 minutes with 1 minute step and per-second rate of HTTP requests as measured over the last 1 minute.

If we combine it with comparison operators, we can create a query which can be used as a basis for the reporting rule of the monitoring system.

max_over_time((rate(nginx_http_requests_total{status="500"}[1m]) > bool 0.6)[5m:1m])

This query will give us only 2 values: 1 and 0. 1 for a critical rate of errors and 0 for non-critical.

Of course, for different types of metrics we need to use a different time range and a different value of comparison, but since we have a built-in subquery in prometheus, it’s much easier to monitor certain metrics(such as errors in http_requests_total).

Graph

The original autor: Mikhail, DevOps Engineer, cloudinfrastack