Skip to main content
Last updated: 25 Sep 2025

Prometheus, grafana and alertmanager

How we use Prometheus on GOV.UK

The Prometheus Operator installs and configures Prometheus, AlertManager and Grafana through Custom Resource Definitions.

Prometheus is a systems monitoring and alerting toolkit. Its features include:

  • A time-series database, specifically optimised for graphing data over time.
  • A DSL for querying the data, called PromQL.
  • An HTTP pull model, consisting of Rack middleware that configures Prometheus to scrape data from an HTTP endpoint. Many GOV.UK applications leverage this functionality via the govuk-app-config gem: For example Finder Frontend’s monitoring of search metrics.
  • A push gateway model, which can be used when an individual job (eg cronjob) creates data that requires graphing or monitoring: Example.

It is possible to query the Prometheus database via its UI, with PromQL but some people find it easier to use Grafana.

Grafana

Grafana provides a frontend for visualizing the Prometheus data. A common workflow when exploring the logs and or creating a new dashboard would be:

  1. Visit https://grafana.eks.production.govuk.digital/dashboards
  2. Create a new dashboard, and save it in the Experiments folder
  3. Click “Add visualization”, and select “Prometheus” as the datasoure
  4. Enter your PromQL query into the Metrics browser field (See Grafana tips) for pointers.
  5. If you would like to create a permanent dashboard, select “Export” and then “Export as JSON”.
  6. Copy the output, and open a PR in govuk-helm-charts. Prior art

Grafana tips

  1. When creating a new visualisation, select “Builder” instead of “Code” which provides a nicer UI with drop down menus so you can see, for example, the names of all the labels that are available.
  2. The label filter “job” refers to the name of the application.
  3. The “Explain” toggle is very helpful, it gives recommendations and syntax.
  4. If you open up an existing dashboard and find a panel that is similar to your use case, you can obtain the PromQL expression to use as a starting point by selecting the three dots in the top right hand corner, and then “Explore”.

Alertmanager

Alertmanager reads data directly from Prometheus, and sends notifications when metrics cross predefined thresholds. These thresholds are defined as rules in govuk-helm-charts. See the Prometheus Operator manual on how to write these rules.

It is also possible to see all of the configured alerts via the Prometheus UI itself, by clicking the Alerts tab.

Somewhat confusingly, the Alertmanager UI does not show all configured alerts, only those which are currently firing.

Alert severity levels

As part of the rule definition, each alert is labelled with one of the following severity levels (in decreasing order of urgency), to indicate the impact and expected response time:

  • Page: Used for urgent, severe conditions that require an immediate response, even out-of-hours. These alerts will be forwarded to PagerDuty. Examples include conditions that make the service unusable or imminently unusable e.g. database is out of disk space.
  • Critical: Used for conditions that should be addressed promptly within usual business hours, but do not require immediate out-of-hours attention. Examples include conditions that are likely to cause an outage soon e.g. disk usage at 90% with less than a day of capacity left.
  • Warning: Used as an early indicator of abnormal conditions, that might require investigation during usual business hours if they persist e.g. disk usage at 70%.

Alerts for purely informational or debugging conditions should be avoided, to reduce noise and alert fatigue.