Skip to main content

Monitoring

Last updated: 19 Apr 2021

Grafana

Grafana is an open-source visualisation tool. It does not store data, but consumes data sources to create real-time graphs displayed on custom dashboards. Data sources include Prometheus, Graphite, Logit and CloudWatch. The query language of the data store, such as PromQL for Prometheus, is used to construct the graphs.

Grafana dashboards

Useful Grafana dashboards:

The full list of Grafana dashboards is stored in the Puppet repo. For details on how to create a new dashboard, read the Grafana dashboards alert documentation.

Grafana tips

You can use regexes to filter for relevant information. For example, *frontend* on the processes dashboard to see all processes that have ‘frontend’ in them.

We often show multiple metrics on the same graph. The position of the key shows which Y-axis each metric corresponds to:

screenshot of multiple metrics selected

You can click on a metric in a graph to show only that metric, or you can CMD + click to select multiple:

screenshot of multiple metrics selected

Annotations on charts show events such as deploys:

screenshot of annotations

For more tips, see the Introduction to Grafana slides.

Fixing N/A in dashboards

When a request for data times out, Grafana will render an “N/A” in the panel. Usually refreshing the page or choosing a shorter time range fixes the issue.

If a dashboard consistently returns “N/A”, then there may be an underlying issue.

In the failing panel, open Query Inspector, and read the error message for clues:

screenshot of query inspector

If you see the following error:

raise CorruptWhisperFile("Unable to read header", fh.name) CorruptWhisperFile: Unable to read header (/opt/graphite/storage/whisper/stats/govuk/app/collections-publisher/ip-10-1-5-36/errors_occurred.wsp)

…that suggests the disk was full at the time of writing to Graphite. The solution is to remove the corrupt file, and ensure there is space on the disk.

SSH into the relevant machine and more errors_occurred.wsp to see the file contents, or ls -lsa in the directory to see the file sizes. This should confirm a file size of zero.

Delete all empty (corrupt) WSP files with:

sudo find /opt/graphite/storage/whisper/ -type f -empty -delete

You should now find the dashboard panels load properly.