Uptime metrics are collected for
whitehall-admin, they are available as a Grafana dashboard.
They are available broken down into a day by day view, highlighted in different colours representing the level of uptime. Green means 100%, orange means above 99.31% (equivalent to 10 minutes of downtime) and red for everything else.
Note: these metrics aren’t a true reflection of availability. Load balancing means that even if a particular health check fails, and the metrics change as a result, publishing is unlikely to be affected.
These metrics are currently used to report up to the GOV.UK senior management team to get an understanding of the health of the platform.
The service which collects the uptime data runs on the monitoring machines and is available to see in govuk-puppet. It works by polling a given endpoint, such as
/healthcheck, every 5 seconds and records an application is up if it receives a 2xx HTTP status code back. It uses statsd to send this data to Graphite under the names
stats.guages.uptime.<application> which is the used in the Grafana dashboard.
If you would like to add another app to the uptime collector, you should first make sure there is a
/healthcheck endpoint available and then add your application to the end of the line in the service file.
Note: if there isn’t an exposed
/healthcheckendpoint available, an alternative can be given by using the format