How to deal with errors
Sometimes applications will encounter exceptions. This policy describes what we should do for different types of errors. It was first proposed in RFC 87.
1. When something goes wrong, we should be notified
Applications should report exceptions to Sentry. Applications must not swallow errors.
2. Notifications should be actionable
Sentry notifications should be something that requires a developer of the app to do something about it. Not just a piece of information.
3. Applications should not error
The goal of GOV.UK is that applications should not error. When something goes wrong it should be fixed.
A code change makes the application crash.
Desired behaviour: error is sent to Sentry, developers are notified and fix the error. Developers mark the error in Sentry as
Resolved. This means a recurrence of the error will alert developers again.
Intermittent errors without user impact
Frontend applications often see timeouts when talking to the content-store.
There’s no or little user impact because the request will be answered by the caching layer.
Desired behaviour: error is not sent to Sentry. Instead, we rely on Smokey and Icinga checks to make sure we the site functions.
Intermittent errors with user impact
Publishing applications sometimes see timeouts when talking to publishing-api. This results in the publisher seeing an error page and possibly losing data.
Desired behaviour: apps handle these errors better, for example by offloading the work to a Sidekiq worker. Since these errors aren’t actionable, they should not be reported to Sentry. They should be tracked in Graphite.
Intermittent retryable errors
Sidekiq worker sends something to the publishing-api, which times out. Sidekiq retries, the next time it works.
Desired behaviour: errors are not reported to Sentry until retries are exhausted. See this PR for an example.
Expected environment-based errors
MySQL errors on staging while data sync happens.
Desired behaviour: our environment is set up such that these errors do not occur.
Bad request errors
User makes a request the application can’t handle (example).
Often happens in security checks.
Desired behaviour: user gets feedback, error is not reported to Sentry
Incorrect bubbling up of errors
Rummager crashes on date parsing, returns
422, which raises an error in finder-frontend.
Desired behaviour: a 4XX reponse is returned to the browser, including an error message. Nothing is ever sent to Sentry.
Manually logged errors
Something goes wrong and we need to let developers know.
Example: Slimmer’s old behaviour
Desired behaviour: developers do not use Sentry for logging. The app either raises the actual error (which causes the user to see the error) or logs the error to Kibana.
IP spoof errors
Desired behaviour: HTTP 400 is returned, error is not reported to Sentry.
Database entry not found
Often a controller will do something like
Thing.find(params[:id]) and rely on Rails to show a 404 page for the
ActiveRecord::RecordNotFound it raises (context).
Desired behaviour: errors are not reported to Sentry
More about Monitoring
- Add a deployment dashboard for an application
- Add an Icinga passive check to a Jenkins job
- Add sidekiq-monitoring to your application
- Error reporting with Sentry
- GOV.UK and Virtual Private Networks (VPNs)
- Graphite and deployment dashboards
- Monitor Sidekiq queues for your application
- Monitoring screens
- Nagios NRPE connection failures
- Pingdom Bouncer canary check
- Tools: Icinga, Grafana and Graphite, Kibana and Fabric
- Uptime Metrics
- Use AWS X-Ray to trace app requests
- Use Terraboard to monitor Terraform state