Sentry
Sentry is a service that collates errors that happen on GOV.UK in one place, regardless of which application, what environment or what machine the error occurred on. It is far more convenient than logging on to individual machines and querying their logs.
Useful links:
- Sentry Organization Stats for an overview of how many errors are being sent/rate-limited site-wide.
- Sentry Projects for a per-project view, enabling you to see which ones are causing the most errors right now.
- Sentry Grafana dashboards: Production, Staging and Integration (all require VPN), for a per-environment breakdown of Sentry errors.
Getting access to Sentry
Your tech lead should raise a PR to give you Sentry access in the govuk_tech.yml file in govuk-user-reviewer. Once the PR is merged and the Terraform has been applied, you’ll be able to sign in using your GDS Google account.
Nomenclature
- Project: each GOV.UK application has its own project on Sentry, where errors originating from that application are consolidated. Example: publishing-api
- Issue: a group of similar errors. For example, multiple instances of
NoMethodError
from the same app are likely to be logged under the same issue (see example) in the project in Sentry. In practice, there are a variety of factors that determine whether errors are grouped under the same issue. - Error/Exception: a single occurrence of an Issue, containing details such as stack trace, arguments, server IP and error message. See example.
How GOV.UK projects are added to Sentry
Projects can be created and edited in the Sentry UI, but this risks creating inconsistencies or missing apps. We therefore configure projects using Terraform.
Teams are managed in govuk-user-reviewer, and projects are managed in govuk-infrastructure.
To create a new team or project, edit the respective terraform and run a plan and apply of the project in Terraform Cloud.
How Sentry is integrated on GOV.UK
Apps are configured to talk to Sentry using the govuk_app_config gem,
which interfaces with Sentry via its GovukError
class. Apps
call GovukError.configure
- see example. This
uses the delegator pattern to proxy requests to the
underlying Sentry gem, which is sentry-ruby in govuk_app_config v4 and
above.
Unhandled exceptions are automatically logged to Sentry, but you can also
manually report something to Sentry using GovukError.notify
.
This method takes an exception object, or a string.
Sentry roles
There are different ‘roles’ in Sentry, with different permissions:
- Most people with access to Sentry have the “Member” role, which lets them view errors, issues and projects.
- Those with the “Admin” role can configure teams and projects.
- Those with the “Manager” role can edit organisation settings and set things like global rate limits.
- Those with the “Owner” role have unrestricted access to the system and can make billing and plan changes.
Some links in this documentation will only be visible to those with the right permissions. If you need a higher permission than you already have, the senior tech team can apply this for you.
Deciding whether to log errors in Sentry
We should only capture actionable errors in Sentry, i.e. errors representing an underlying issue that developers can fix. We should not capture 4XX responses, for example, as these indicate a client issue. In fact, sentry-raven ignores several exception types by default - see below.
Ignoring exception types
govuk_app_config has a list of ignorable exceptions (
excluded_exceptions
). If your
application raises an exception on this list, it will not be logged in Sentry.
Note that sentry-raven also has a
default list of exceptions to ignore, but
these are overwritten, rather than appended by
govuk_app_config, so have no meaning on GOV.UK.
Then there is a list of exceptions to ignore if they occur during the nightly
data sync: data_sync_excluded_exceptions
.
These are exceptions that are commonly raised while the databases and content
store are periodically unavailable.
When determining whether or not to ignore an exception, the exception chain is
inspected. This means if an exception isn’t on the ignore list, but its
underlying cause is an exception on the ignore list, then it will be ignored.
This applies to excluded_exceptions
and
to data_sync_excluded_exceptions
.
Advanced Sentry customisation
For all of the above, you can easily configure any of these properties by appending to them.
You can also run arbitrary code to decide whether or not an error should be
logged in Sentry, using the before_send
lambda. Your custom code will be
combined with the default evaluators such that if any of the
callbacks returns nil, the exception will be excluded.
When errors are received by Sentry
Sentry first fingerprints the error to decide what Issue to group it under. It then checks how many occurrences of that Issue have happened recently, and rate-limits (i.e. ignores) this occurrence if it is happening too frequently. More details below.
Fingerprinting
Errors are grouped into Issues automatically by Sentry. By default, Sentry groups based on stack trace. If no stack trace is available, it will group by exception type instead. Failing that, it will group by error message.
Fingerprint rules can be applied at the project level by editing the “Issue Grouping” page under a project (see example). For example, you can force all errors of the same exception type to have the same fingerprint. On the same screen, you can set stack trace rules, to remove or rename certain ‘frames’ in the trace. In practice, we don’t currently apply any custom rules, after an in-depth investigation found that Sentry’s default grouping was already very accurate (despite it often looking like the same exception is spread across multiple issues).
Stack trace cleaning
Note that sentry-raven automatically cleans up stack traces for the purposes of fingerprinting, so that stack trace frames like:
app/views/welcome/view_error.html.erb in _app_views_welcome_view_error_html_erb__2807287320172182514_65600 at line 1
…get normalised to:
app/views/welcome/view_error.html.erb at line 1
This means if the same ActionView::Template::Error
error happens twice, it
is grouped into the same issue instead of erroneously treated as separate issues.
sentry-raven uses a customised version of
Rails::BacktraceCleaner
to do this. The
original BacktraceCleaner
would do this, but also
remove any “framework trace” from the stacktrace, leaving only the “application
trace” behind. The original BacktraceCleaner
would actually be beneficial for
grouping, as Sentry is very sensitive to stack traces that differ only slightly
(which can happen when a dependency is updated or a Ruby version is upgraded).
However, removing the framework traces entirely means losing key diagnostic
information, so the customised, less aggressive BacktraceCleaner
is a better
choice.
Rate limiting
GOV.UK is on a legacy billing plan with an account limit of 1000 events per hour. When this limit is reached, subsequent events get discarded, meaning that noisy issues can prevent other, more important issues from being logged.
In practice, it’s more complicated than that. We have a per-project limit of 50%, the lowest that Sentry allows. This means that any one project on Sentry cannot log more than 500 events before it is rate limited. This is a protection mechanism, ensuring that we would need two projects to be recording high-volume errors to risk breaching our account limit. These limits are configured on the Rate Limits page.
Slack alerts
You can configure Sentry to notify a Slack channel when a condition is satisfied, such as when any issue records 100 or more errors in a 1 hour period.
To set up an alert, visit the Alerts panel, select the project the alert
should apply to (e.g. app-whitehall
), and then click “Create Alert Rule”. It is
currently not possible to set up a ‘global’ alert to apply to all projects at
once.
We encourage teams to set up alerts for any projects they’re responsible for, so that they can be alerted to new and high-volume issues and prioritise them. Multiple teams are allowed to set up alerts for the same projects.
Sentry issue actions
In the Sentry UI, you can merge related issues together, resolve issues, or archive issues (permanently or for a set time period).
Merging Sentry issues
Duplicate issues can be merged by going into the project view, checking the boxes associated with each issue, and clicking “Merge”. This should be done very carefully, as issues cannot be unmerged once they are merged, and the issues are often subtly different and warrant being separate issues. For example, they may have the same exception type but occur in different transactions, so require two separate fixes.
To find out why Sentry has treated two issues as separate, visit each issue and scroll to the bottom of the page to see the “Event grouping information”.
Resolving an issue
When you know you’ve fixed the underlying issue, you should comment on the issue explaining what you’ve done, and then click the “Resolve” button. This removes it from the default Sentry UI, making it less noisy, but also has the advantage of marking it as a regression and emailing you if the issue occurs again.
Archiving (or ignoring) an issue
You can “Archive” something for a set period of time - this option used to be called “Ignore” in Sentry. Archiving an issue for a set period of time can be useful if you’ve identified an issue and written it up as a Trello card (or are actively working on fixing it), as it prevents Sentry from accumulating events and potentially spamming your Slack channel.
In these cases, you should comment on the issue with a link to your card or PR, then archive the issue for a set period . You should also set the “Assignee” to yourself.
You can always un-archive an issue later if needed.
Commenting on an issue
Click on the issue, then on the “Activity” tab, where you can leave a comment. Comments support markdown.
Deleting and discarding an issue
If you’ve identified an issue that is high-volume, but is unlikely to be fixed any time soon, you can Delete and Discard the issue by clicking the arrow next to the trash can and selecting “Delete and discard future events”.
This should only be used when the issue is likely to have a significant impact on our Sentry quota. It is possible to “undiscard” the issue later, but this will only capture new events. Any events prior to the “undiscard” action are lost.
GDS-wide usage of Sentry
Sentry is used by several programmes in GDS, not just GOV.UK. A report, GDS use of Sentry.io, covers this in more detail, including documenting some of the limitations of the setup.