Skip to main content
Warning This document has not been updated for a while now. It may be out of date.
Last updated: 5 Apr 2022

Fastly error rate for GOV.UK

We get response code reporting from Fastly (with a 15 minute delay). It averages out the last 15 minutes worth of 5xx errors. This is a useful supplementary metric to highlight low-level errors that occur over a longer period of time.

The alert appears on monitoring-1.management.

It is possible to query the CDN logs using AWS Athena. The following query gets a count of URLs where a 5xx error has been served between the given timestamps:

SELECT url, status, COUNT(*) AS count
FROM fastly_logs.govuk_www
WHERE status >= 500 AND status <= 599
AND request_received >= TIMESTAMP '2018-11-26 11:00'
AND request_received < TIMESTAMP '2018-11-26 11:30'
AND date = 26 AND month = 11 AND year = 2018
GROUP BY url, status
ORDER BY count DESC

It is also possible to examine the raw Fastly CDN logs:

  • ssh monitoring-1.management.production
  • cd /var/log/cdn to access log files

Unknown alert

The alert appears on monitoring-1.management. Collectd uses the Fastly API to get statistics which it pushes to Graphite. If the alert is unknown, collectd likely cannot talk to Fastly so restart collectd.

$ sudo service collectd restart

To prove collectd is the problem, use this query in Kibana:

syslog_hostname:monitoring-1.management AND syslog_program:collectd

You will see many reports similar to:

cdn_fastly plugin: Failed to query service