Table of contents
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert

This page was imported from the opsmanual on github.gds. It hasn’t been reviewed for accuracy yet. View history in old opsmanual

Fastly error rate for GOV.UK

Error rate alert

We get response code reporting from Fastly (with a 15 minute delay). It averages out the last 15 minutes worth of 5xx errors. This is a useful supplementary metric to highlight low-level errors that occur over a longer period of time.

The alert appears on monitoring-1.management. A good starting point for investigation is to examine the Fastly CDN logs.

  • ssh logs-cdn-1.management.production
  • cd /mnt/logs_cdn to access log files

Unknown alert

The alert appears on monitoring-1.management. Collectd uses the Fastly API to get statistics which it pushes to Graphite. If the alert is unknown, collectd likely cannot talk to Fastly so restart collectd.

  • sudo service collectd restart

To prove collectd is the problem, use this query in Kibana:

@source_host:monitoring-1.management AND @fields.syslog_program:collectd

You will see many reports simlilar to:

cdn_fastly plugin: Failed to query service

This page is owned by #2ndline and needs to be reviewed