Sentry is a service that collates errors that happen on GOV.UK in one place, regardless of which application, what environment or what machine the error occurred on. It is far more convenient than logging on to individual machines and querying their logs.
- Sentry Organization Stats for an overview of how many errors are being sent/rate-limited site-wide.
- Sentry Projects for a per-project view, enabling you to see which ones are causing the most errors right now.
- Sentry Grafana dashboards: Production, Staging and Integration (all require VPN), for a per-environment breakdown of Sentry errors.
- Project: each GOV.UK application has its own project on Sentry, where errors originating from that application are consolidated. Example: publishing-api
- Issue: a group of similar errors. For example, multiple instances of
NoMethodErrorfrom the same app are likely to be logged under the same issue (see example) in the project in Sentry. In practice, there are a variety of factors that determine whether errors are grouped under the same issue.
- Error/Exception: a single occurrence of an Issue, containing details such as stack trace, arguments, server IP and error message. See example.
How Sentry is integrated on GOV.UK
Projects can be created and edited in the Sentry UI, but this risks creating inconsistencies or missing apps. We therefore configure projects using govuk-saas-config (and its associated rake tasks), which read a list of apps from govuk-developer-docs and make sure that all configuration is set up correctly.
Apps are configured to talk to Sentry using the govuk_app_config gem,
which interfaces with Sentry via its
GovukError class. Apps
GovukError.configure - see example. This
delegates to the sentry-raven gem under the hood,
though the gem is now superseded by sentry-ruby, which we have
plans to migrate to in the future.
Unhandled exceptions are automatically logged to Sentry, but you can also
manually report something to Sentry using
There are different ‘roles’ in Sentry, with different permissions:
- Most people with access to Sentry have the “Member” role, which lets them view errors, issues and projects.
- Those with the “Admin” role can configure teams and projects.
- Those with the “Manager” role can edit organisation settings and set things like global rate limits.
- Those with the “Owner” role have unrestricted access to the system and can make billing and plan changes.
Some links in this documentation will only be visible to those with the right permissions. If you need a higher permission than you already have, the senior tech team can apply this for you.
Deciding whether to log errors in Sentry
We should only capture actionable errors in Sentry, i.e. errors representing an underlying issue that developers can fix. We should not capture 4XX responses, for example, as these indicate a client issue. In fact, sentry-raven ignores several exception types by default - see below.
Ignoring exception types
govuk_app_config has a list of ignorable exceptions (
excluded_exceptions). If your
application raises an exception on this list, it will not be logged in Sentry.
Note that sentry-raven also has a
default list of exceptions to ignore, but
these are overwritten, rather than appended by
govuk_app_config, so have no meaning on GOV.UK.
Then there is a list of exceptions to ignore if they occur during the nightly
These are exceptions that are commonly raised while the databases and content
store are periodically unavailable.
When determining whether or not to ignore an exception, the exception chain is
inspected. This means if an exception isn’t on the ignore list, but its
underlying cause is an exception on the ignore list, then it will be ignored.
This applies to
There is a hardcoded list of environments in which Sentry is
considered to be active. If the
SENTRY_CURRENT_ENV environment variable
available to your app is not on the
active_sentry_environments list, then none of
its errors will be logged to Sentry.
Advanced Sentry customisation
For all of the above, you can easily configure any of these properties by appending to them.
You can also run arbitrary code to decide whether or not an error should be
logged in Sentry, using the
should_capture lambda. Your custom code will be
lazily combined with the default evaluators such that if an
exception is on any of the ‘excluded exceptions’ lists, it will be excluded
(even if your custom
should_capture callback returns true).
When errors are received by Sentry
Sentry first fingerprints the error to decide what Issue to group it under. It then checks how many occurrences of that Issue have happened recently, and rate-limits (i.e. ignores) this occurrence if it is happening too frequently. More details below.
Errors are grouped into Issues automatically by Sentry. By default, Sentry groups based on stack trace. If no stack trace is available, it will group by exception type instead. Failing that, it will group by error message.
Fingerprint rules can be applied at the project level by editing the “Issue Grouping” page under a project (see example). For example, you can force all errors of the same exception type to have the same fingerprint. On the same screen, you can set stack trace rules, to remove or rename certain ‘frames’ in the trace. In practice, we don’t currently apply any custom rules, after an in-depth investigation found that Sentry’s default grouping was already very accurate (despite it often looking like the same exception is spread across multiple issues).
Stack trace cleaning
Note that sentry-raven automatically cleans up stack traces for the purposes of fingerprinting, so that stack trace frames like:
app/views/welcome/view_error.html.erb in _app_views_welcome_view_error_html_erb__2807287320172182514_65600 at line 1
…get normalised to:
app/views/welcome/view_error.html.erb at line 1
This means if the same
ActionView::Template::Error error happens twice, it
is grouped into the same issue instead of erroneously treated as separate issues.
sentry-raven uses a customised version of
Rails::BacktraceCleaner to do this. The
BacktraceCleaner would do this, but also
remove any “framework trace” from the stacktrace, leaving only the “application
trace” behind. The original
BacktraceCleaner would actually be beneficial for
grouping, as Sentry is very sensitive to stack traces that differ only slightly
(which can happen when a dependency is updated or a Ruby version is upgraded).
However, removing the framework traces entirely means losing key diagnostic
information, so the customised, less aggressive
BacktraceCleaner is a better
GOV.UK is on a legacy billing plan with an account limit of 1000 events per hour. When this limit is reached, subsequent events get discarded, meaning that noisy issues can prevent other, more important issues from being logged.
In practice, it’s more complicated than that. We have a per-project limit of 50%, the lowest that Sentry allows. This means that any one project on Sentry cannot log more than 500 events before it is rate limited. This is a protection mechanism, ensuring that we would need two projects to be recording high-volume errors to risk breaching our account limit. These limits are configured on the Rate Limits page.
In addition, we’ve set up alerting so that any issue which records 100 or more
errors in an hour period gets alerted to
#govuk-platform-health. These alerts
are configured in the Alerts panel.
Sentry issue actions
In the Sentry UI, you can merge related issues together, resolve issues, or ignore issues (permanently or for a set time period).
Merging Sentry issues
Duplicate issues can be merged by going into the project view, checking the boxes associated with each issue, and clicking “Merge”. This should be done very carefully, as issues cannot be unmerged once they are merged, and the issues are often subtly different and warrant being separate issues. For example, they may have the same exception type but occur in different transactions, so require two separate fixes.
To find out why Sentry has treated two issues as separate, visit each issue and scroll to the bottom of the page to see the “Event grouping information”.
Resolving an issue
When you know you’ve fixed the underlying issue, you should comment on the issue explaining what you’ve done, and then click the “Resolve” button. This removes it from the default Sentry UI, making it less noisy, but also has the advantage of marking it as a regression and emailing you if the issue occurs again.
Ignoring an issue
When you’ve identified an issue and written it up as a Trello card, or are actively working on fixing it, it can be unhelpful for the issue to keep accumulating events (and potentially spamming your Slack channel). In these cases, you should comment on the issue with a link to your card or PR, then ignore the issue. You should also set the “Assignee” to yourself.
You can either click “Ignore” to ignore it permanently, or click the arrow next to it to ignore for a set time, e.g. 1 week. You can also un-ignore an issue later.
Commenting on an issue
Click on the issue, then on the “Activity” tab, where you can leave a comment. Comments support markdown.
Deleting and discarding an issue
If you’ve identified an issue that is high-volume, but is unlikely to be fixed any time soon, you can Delete and Discard the issue by clicking the arrow next to the trash can and selecting “Delete and discard future events”.
This should only be used when the issue is likely to have a significant impact on our Sentry quota. It is possible to “undiscard” the issue later, but this will only capture new events. Any events prior to the “undiscard” action are lost.
Special Sentry accounts
There is a
2ndLineBot member on the members list
which is set up so that a weekly Sentry report is sent to the 2nd line email address.
This bot account should not be deleted.
GDS-wide usage of Sentry
Sentry is used by several programmes in GDS, not just GOV.UK. A report, GDS use of Sentry.io, covers this in more detail, including documenting some of the limitations of the setup.