Table of contents
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert

Fastly error rate for GOV.UK

Error rate alert

We get response code reporting from Fastly (with a 15 minute delay). It averages out the last 15 minutes worth of 5xx errors. This is a useful supplementary metric to highlight low-level errors that occur over a longer period of time.

The alert appears on monitoring-1.management. A good starting point for investigation is to examine the Fastly CDN logs.

  • ssh logs-cdn-1.management.production
  • cd /mnt/logs_cdn to access log files

Alternatively you can look in Kibana with the query application:"govuk-cdn-logs-monitor"

Unknown alert

The alert appears on monitoring-1.management. Collectd uses the Fastly API to get statistics which it pushes to Graphite. If the alert is unknown, collectd likely cannot talk to Fastly so restart collectd.

  • sudo service collectd restart

To prove collectd is the problem, use this query in Kibana:

syslog_hostname:monitoring-1.management AND syslog_program:collectd

You will see many reports similar to:

cdn_fastly plugin: Failed to query service

More about Icinga alerts

This page is owned by #2ndline and needs to be reviewed