GOV.UK Site search alerts and monitoring
As desribed in GOV.UK Search: How it works there are two search stacks on GOV.UK. This page includes information on the monitoring and alerting for GOV.UK site search only.
Sentry
Sentry is set up to track application errors from Finder Frontend, and Search API v2 including the sync process and API calls from frontend apps.
Kibana
Kibana can be used to query application logs as per all other GOV.UK apps.
Grafana
The GOV.UK Search Grafana dashboard visualizes core metrics for site search.
Alertmanager
We have an informal SLO to maintain a search and autocomplete success rate of about 99.99% over any 24 hour period. If rates drop below this, there are currently four Alertmanager rules configured in govuk-helm-charts to notify the #govuk-search-alerts channel.
These alerts are purely based on the ratio of error responses to success responses served by SearchAPI v2. In other words they are not alerting on search result quality. Those alerts are coming soon!
Rules:
SearchDegradedAcute 5 minute rolling success rate for search requests has dropped below 99% for more than 10 minutes.
SearchDegradedMid 1 hour rolling success rate for search requests has dropped below 99.9% for more than 2 hours.
SearchDegradedLong 24 hour rolling success rate for search requests has dropped below 99.99% for more than 24 hours.
AutocompleteDegradedAcute 5 minute rolling success rate for autocomplete requests has dropped below 90% for more than 10 minutes.
Causes and steps to take in the event of a Degradation of service alert firing
We are aware of the following occasional errors which should not be considered critical and do not need intervention unless they occur consistently for a large number of users and don’t go away by themselves within a few minutes.
Google::Cloud::DeadlineExceededError
A timeout occurred on the Google APIGoogle::Cloud::InternalError
An internal error occurred on the Google APIAMQ::Protocol::EmptyResponseError
RabbitMQ sent an unexpected response, possibly due to restarting (the listener will restart by itself in most cases)
If these errors persist and trigger the Degradation of service alerts, this indicates an issue with GCP or Google Vertex AI Search.
- Login to the Google Cloud console, and make sure that the project Search API v2 Production is selected.
- Under “APIs & Services” in the GCP Console, review the Discovery Engine API usage for traffic and error rates.
How to contact Google if there is a critical issue with GCP or Google Vertex AI Search
- To raise a support ticket, you will first need to login to the GCP console
- Navigate to the Support/Cases section and press the GET HELP button, ensuring to provide comprehensive reproduction steps.
- For catastrophic issues or if regular support is unresponsive, escalate the problem in the Google Chat space, instructions are linked from the #govuk-search team slack channel. Always include the support case number.