Most teams in GOV.UK have screens set up to show data about pull requests, released and/or current work.
The search screen displays live data from GOV.UK. It includes the number of people on GOV.UK, latest searches, trending and recent content. It’s not publicly accessible because there’s sometimes personal data in the latest searches.
2nd line screens
There are two screens by the 2ndline desks.
The top screen is a webpage running frame splits with two splits: production health and Icinga alert summary per environment.
The bottom screen is a PaaS-hosted Grafana dashboard showing statistics for data.gov.uk.
This dashboard contains 2 graphs, one of origin 4xx and 5xx, and one of edge 4xx and 5xx. It’s worth keeping an eye on this and looking for any anomalies, as this may indicate issues on production. It’s likely due to our caching behaviour that the top graph of origin errors will indicate issues before they are visible in the second graph, and to end users.
Sometimes the ‘EDGE’ graphs may disappear. These are obtained by the
collectd-cdn plugin on
monitoring-1.management.production. If the graphs disappear, they
should write errors to
/var/log/syslog. They may look something like
Nov 10 11:37:17 monitoring-1.management collectd: cdn_fastly plugin: Failed to query service: govuk Nov 10 11:37:17 monitoring-1.management collectd: cdn_fastly plugin: Failed to query service: tldredirect Nov 10 11:37:17 monitoring-1.management collectd: cdn_fastly plugin: Failed to query service: assets Nov 10 11:37:17 monitoring-1.management collectd: cdn_fastly plugin: Failed to query service: redirector
If this happens, restarting
collectd on the monitoring server may kick
things into life.
sudo service collectd restart
Icinga alert summary per environment
This screen shows a summary of the critical and warning alerts for our four environments (production, staging, integration, CI) plus upcoming AWS environments in a colour-coded box (red for critical errors, yellow for warnings, purple for unknown errors and green for no issues).
More about Monitoring
- Add a deployment dashboard for an application
- Add an Icinga passive check to a Jenkins job
- Add sidekiq-monitoring to your application
- Error reporting with Sentry
- GOV.UK and Virtual Private Networks (VPNs)
- Graphite and deployment dashboards
- How to deal with errors
- Monitor Sidekiq queues for your application
- Nagios NRPE connection failures
- Pingdom Bouncer canary check
- Tools: Icinga, Grafana and Graphite, Kibana and Fabric
- Uptime Metrics
- Use AWS X-Ray to trace app requests
- Use Terraboard to monitor Terraform state