GOV.UK Site search alerts and monitoring
As described in GOV.UK Search: How it works there are two search stacks on GOV.UK. This page includes information on the monitoring and alerting for GOV.UK site search only.
Monitoring and alerting tools
Sentry
Sentry is set up to track application errors from Finder Frontend, Search API v2 and Search API v2 beta features including the sync process and API calls from frontend apps.
All environments are configured for Slack notifications. Production alerts are routed to #govuk-search-alerts, while alerts for other environments are sent to #govuk-search-alerts-nonprod.
Kibana
Kibana can be used to query application logs as per all other GOV.UK apps.
Grafana
The GOV.UK Search Grafana dashboard visualizes core metrics for site search.
Alertmanager
Alertmanager is used to monitor metrics in Prometheus, and send notifications when metrics cross predefined thresholds.
For site search, we use AlertManager to notify us of “degradation of service” and “degradation of quality” alerts.
All environments are configured for Slack notifications. Production alerts are routed to #govuk-search-alerts, while alerts for other environments are sent to #govuk-search-alerts-nonprod.
Alert types and how to handle them
Degradation of service alerts
We have an informal SLO to maintain a search and autocomplete success rate of about 99.99% over any 24 hour period. There are currently four Alertmanager rules configured in govuk-helm-charts to send notifications on Slack, if rates drop below this:
SearchDegradedAcute 5 minute rolling success rate for search requests has dropped below 99% for more than 10 minutes.
SearchDegradedMid 1 hour rolling success rate for search requests has dropped below 99.9% for more than 2 hours.
SearchDegradedLong 24 hour rolling success rate for search requests has dropped below 99.99% for more than 24 hours.
AutocompleteDegradedAcute 5 minute rolling success rate for autocomplete requests has dropped below 90% for more than 10 minutes.
Causes and steps to take in the event of a degradation of service alert firing
We are aware of the following occasional errors which should not be considered critical and do not need intervention unless they occur consistently for a large number of users and don’t go away by themselves within a few minutes.
Google::Cloud::DeadlineExceededErrorA timeout occurred on the Google APIGoogle::Cloud::InternalErrorAn internal error occurred on the Google APIAMQ::Protocol::EmptyResponseErrorRabbitMQ sent an unexpected response, possibly due to restarting (the listener will restart by itself in most cases)
If these errors persist and trigger the degradation of service alerts, this indicates an issue with GCP or Google Vertex AI Search.
- Login to the Google Cloud console, and make sure that the project Search API v2 Production is selected.
- Under “APIs & Services” in the GCP Console, review the Discovery Engine API usage for traffic and error rates.
- Issues with Vertex should be raised with Google.
Degradation of search quality alerts
We have additional Alertmanager rules related to search result quality configured in govuk-helm-charts to send notifications on Slack, if search quality drops below given thresholds:
SearchQualityDegradedBinaryRecall Top 3 recall for the binary sample query set has dropped below 90% (“warning” level), and below 80% (“critical” level).
SearchQualityDegradedClickstreamNDCG Top 10 NDCG for the clickstream sample query set has dropped below 85% (“warning” level), and below 75% (“critical” level).
Steps to take in the event of a degradation of search quality firing
Significant drops in search quality need to be investigated by the search team to diagnose the issue and raise a support ticket with Google, if appropriate. If you notice a drop in search quality, make sure the Performance Analyst, Product Manager and Delivery Manager on the search team are aware.
Failures running evaluations related rake tasks
Evaluations are run regularly on a schedule, to measure and monitor the quality of search results and notify the search team if quality drops via “degradation of quality” alerts. Parts of this workflow are executed via rake tasks in Search API v2 beta features, and raise errors in Sentry when they fail. The most common errors are:
Google::Cloud::AlreadyExistsError(Active evaluation already exists): Only one evaluation in each environment is allowed to be run at one time. This error occurs when an evaluation is triggered when there is already one running.Google::Cloud::NotFoundError(SampleQuerySet): Occurs when a sample query set has not been created, but a rake task is attempting to use it.Google::Cloud::NotFoundError(ServingConfig): Occurs when an evaluation cannot locate the specified serving configuration.
Because the evaluations feature is in beta, rake tasks can also fail for unexpected reasons.
Steps to take in the event of a sample query set not being created
If a rake task is attempting to use a sample query set that does not exist and a related Google::Cloud::NotFoundError is being raised, the relevant sample query set should be created manually by re-running the cronJob that creates the sample query sets in Argo. This rake task will run through to create all sample query sets (binary, clickstream and explicit) for the current month, and import the data from BigQuery. Where the sample query set has been created already, the creation of the sample query set will be skipped and the data will be reimported (replaced) into the sample query set.
It is important to ensure that all required sample query sets exist in all environments, since evaluations cannot run without them.
Steps to take in the event of a failed evaluation run
If an evaluation fails for a reason other than a sample query set not existing, take the following steps:
Check the environment. If the failure is in production, carry on through the next steps. Otherwise, do nothing. (Quality of search results is less meaningful/important in non-production environments.)
Check which evaluations have failed by checking logs in Kibana. To get an effective read on current quality of search results, we need to be regularly reporting our important metrics for “this month”. In practice this can mean:
- Ignoring any failed runs of "last month" evaluations.
- Ignoring any failed runs of the explicit evaluations.
- Ensuring there is a successful run of the "this month" clickstream evaluation at least every other day.
- Ensuring there is a successful run of the "this month" binary evaluation at least twice a day.
If an evaluation needs to be re-run, carry on through the next steps.
(Optional) Temporarily pause scheduled evaluations. Because only one evaluation can run at a time in each environment, if a scheduled evaluation starts before an ad-hoc run has finished, the scheduled evaluation is likely to fail and raise more alerts. If you want to avoid this, you can edit the cron schedule in helm charts to temporarily pause scheduled evaluations.
Re-run the relevant evaluation cronJob in Argo for the evaluation that has failed. Note that the evaluations rake task will run the evaluation for both “this month” and “last month”.
Restore usual schedule of evaluation runs. This is only relevant if scheduled evaluations were temporarily paused in step 3.
How to contact Google if there is a critical issue with GCP or Google Vertex AI Search
- To raise a support ticket, you will first need to login to the GCP console
- Navigate to the Support/Cases section and press the GET HELP button, ensuring to provide comprehensive reproduction steps.
- For catastrophic issues or if regular support is unresponsive, escalate the problem in the Google Chat space, instructions are linked from the #govuk-search team slack channel. Always include the support case number.