Uptime metrics are collected for
whitehall-admin, they are available as
a Grafana dashboard.
They are available broken down into a day by day view, highlighted in different colours representing the level of uptime. Green means 100%, orange means above 99.31% (equivalent to 10 minutes of downtime) and red for everything else.
Note These metrics aren’t a true reflection of availability. Loadbalancing means that even if a particular healthcheck fails, and the metrics change as a result, publishing is unlikely to be affected.
The service which collects the uptime data runs on the monitoring machines and
is available to see in govuk-puppet. It works by polling
a given endpoint, such as
/healthcheck, every 5 seconds and records an application
is up if it receives a 2xx HTTP status code back. It uses statsd to send this data to Graphite under
stats.guages.uptime.<application> which is the used in the Grafana
If you would like to add another app to the uptime collector, you should first
make sure there is a
/healthcheck endpoint available and then add your
application to the end of the line in the service file.
Note If there is not an exposed
/healthcheckendpoint available, an alternative can be given by using the format