Working With CPU Metrics From Node Exporter
Run stress -c 5
on your server before starting this lesson.
With the Node Exporter up and running, we now have access to a number of infrastructure metrics on Prometheus, including data about our CPU. The processing power of our server determines how well basically everything on our server runs, so keeping track of its cycles can be invaluable for diagnosing problems and reviewing trends in how our applications and services are running.
For almost all monitoring solutions, including Prometheus, data for this metric is pulled from the /proc/stat
file on the host itself, and in Prometheus these metrics are provided to us in expressions that start with node_cpu
. Assuming we’re not running any guests on our host, the core expression for this that we want to review is the node_cpu_seconds_total
metric.
node_cpu_seconds_total
works as a counter — that is, it keeps track of how long the CPU spends in each mode, in seconds, and adds it to a persistent count. Counters might not seem especially helpful on their own, but combined with the power of math, we can actually get a lot of information out of it.
Most of the time, what would be helpful here is viewing the percentages and averages that our CPU spends in either the idle more or any working modes. In Prometheus, we can do this with the rate
and irate
queries, which calculate the per-second average change in the given time series in a range. irate
is specifically for fast-moving counters (like our CPU); both should be used with counter-based metrics specifically.
We can see what amount of time our server spends in each mode by running irate(node_cpu_seconds_total[30s]) * 100
in the expression editor with a suggested limit of 30m
, assuming you’re using a cloud playground server.
Additionally, we can check for things like the percentage of time the CPU is performing userland processes:
irate(node_cpu_seconds_total{mode="user"}[1m]) * 100
Or we can determine averages across our entire fleet with the avg
operator for Prometheus:
avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
Other metrics to consider include the node_cpu_guest_seconds_total
metric, which works similarly to node_cpu_seconds_total
but is especially useful for any machine running guest virtual machines.