GOV.UK Site search alerts and monitoring

Last updated: 12 Jan 2026

As described in GOV.UK Search: How it works there are two search stacks on GOV.UK. This page includes information on the monitoring and alerting for GOV.UK site search only.

Monitoring and alerting tools

Sentry

Sentry is set up to track application errors from Finder Frontend, Search API v2 and Search API v2 beta features including the sync process and API calls from frontend apps.

All environments are configured for Slack notifications. Production alerts are routed to #govuk-search-alerts, while alerts for other environments are sent to #govuk-search-alerts-nonprod.

Kibana

Kibana can be used to query application logs as per all other GOV.UK apps.

Grafana

The GOV.UK Search Grafana dashboard visualizes core metrics for site search.

Alertmanager

Alertmanager is used to monitor metrics in Prometheus, and send notifications when metrics cross predefined thresholds.

All environments are configured for Slack notifications. Production alerts are routed to #govuk-search-alerts, while alerts for other environments are sent to #govuk-search-alerts-nonprod.

Alert types and how to handle them

Degradation of service alerts

We monitor and alert on both Search API v2 success rates, and Google Vertex response times.

Search API v2 success rates

We have an informal SLO to maintain a search and autocomplete success rate of about 99.99% over any 24 hour period. There are currently four Alertmanager rules configured in govuk-helm-charts to send notifications on Slack, if rates drop below this:

SearchDegradedAcute 5 minute rolling success rate for search requests has dropped below 99% for more than 10 minutes.
SearchDegradedMid 1 hour rolling success rate for search requests has dropped below 99.9% for more than 2 hours.
SearchDegradedLong 24 hour rolling success rate for search requests has dropped below 99.99% for more than 24 hours.
AutocompleteDegradedAcute 5 minute rolling success rate for autocomplete requests has dropped below 90% for more than 10 minutes.

Google Vertex AI Search request durations

There is currently one Alertmanager rule configured in govuk-helm-charts, HighVertexP90Latency, which sends notifications in Slack if requests to Google Vertex’s Search endpoint exceed acceptable duration thresholds.

Causes and steps to take in the event of a degradation of service alert firing

We are aware of the following occasional errors which should not be considered critical and do not need intervention unless they occur consistently for a large number of users and don’t go away by themselves within a few minutes.

Google::Cloud::DeadlineExceededError A timeout occurred on the Google API
Google::Cloud::InternalError An internal error occurred on the Google API
AMQ::Protocol::EmptyResponseError RabbitMQ sent an unexpected response, possibly due to restarting (the listener will restart by itself in most cases)

If these errors persist and trigger the degradation of service alerts, this indicates an issue with GCP or Google Vertex AI Search.

Login to the Google Cloud console, and make sure that the project Search API v2 Production is selected.
Under “APIs & Services” in the GCP Console, review the Discovery Engine API usage for traffic and error rates.
Issues with Vertex should be raised with Google.

Degradation of search quality alerts

We have additional Alertmanager rules related to search result quality configured in govuk-helm-charts to send notifications on Slack, if search quality drops below given thresholds:

SearchQualityDegradedBinaryRecall Top 3 recall for the binary sample query set has dropped below 90% (“warning” level), and below 80% (“critical” level).
SearchQualityDegradedClickstreamNDCG Top 10 NDCG for the clickstream sample query set has dropped below 85% (“warning” level), and below 75% (“critical” level).

Steps to take in the event of a degradation of search quality firing

Significant drops in search quality need to be investigated by the search team to diagnose the issue and raise a support ticket with Google, if appropriate. If you notice a drop in search quality, make sure the Performance Analyst, Product Manager and Delivery Manager on the search team are aware.

Evaluations are run regularly on a schedule, to measure and monitor the quality of search results and notify the search team if quality drops via “degradation of quality” alerts. Parts of this workflow are executed via rake tasks in Search API v2 beta features, and raise errors in Sentry when they fail. The most common errors are:

Google::Cloud::AlreadyExistsError (Active evaluation already exists): Only one evaluation in each environment is allowed to be run at one time. This error occurs when an evaluation is triggered when there is already one running.
Google::Cloud::NotFoundError (SampleQuerySet): Occurs when a sample query set has not been created, but a rake task is attempting to use it.
Google::Cloud::NotFoundError (ServingConfig): Occurs when an evaluation cannot locate the specified serving configuration.

Because the evaluations feature is in beta, rake tasks can also fail for unexpected reasons.

Steps to take in the event of a sample query set not being created

If a rake task is attempting to use a sample query set that does not exist and a related Google::Cloud::NotFoundError is being raised, the relevant sample query set should be created manually by re-running the cronJob that creates the sample query sets in Argo. This rake task will run through to create all sample query sets (binary, clickstream and explicit) for the current month, and import the data from BigQuery. Where the sample query set has been created already, the creation of the sample query set will be skipped and the data will be reimported (replaced) into the sample query set.

It is important to ensure that all required sample query sets exist in all environments, since evaluations cannot run without them.

Steps to take in the event of a failed evaluation run

If an evaluation fails for a reason other than a sample query set not existing, take the following steps:

Check the environment. If the failure is in production, carry on through the next steps. Otherwise, do nothing. (Quality of search results is less meaningful/important in non-production environments.)
Check which evaluations have failed by checking logs in Kibana. To get an effective read on current quality of search results, we need to be regularly reporting our important metrics for “this month”. In practice this can mean:

- Ignoring any failed runs of "last month" evaluations.
- Ignoring any failed runs of the explicit evaluations.
- Ensuring there is a successful run of the "this month" clickstream evaluation at least every other day.
- Ensuring there is a successful run of the "this month" binary evaluation at least twice a day.

If an evaluation needs to be re-run, carry on through the next steps.

(Optional) Temporarily pause scheduled evaluations. Because only one evaluation can run at a time in each environment, if a scheduled evaluation starts before an ad-hoc run has finished, the scheduled evaluation is likely to fail and raise more alerts. If you want to avoid this, you can edit the cron schedule in helm charts to temporarily pause scheduled evaluations.
Re-run the relevant evaluation cronJob in Argo for the evaluation that has failed. Note that the evaluations rake task will run the evaluation for both “this month” and “last month”.
Restore usual schedule of evaluation runs. This is only relevant if scheduled evaluations were temporarily paused in step 3.

How to contact Google if there is a critical issue with GCP or Google Vertex AI Search

To raise a support ticket, you will first need to login to the GCP console
Navigate to the Support/Cases section and press the GET HELP button, ensuring to provide comprehensive reproduction steps.
For catastrophic issues or if regular support is unresponsive, escalate the problem in the Google Chat space, instructions are linked from the #govuk-search team slack channel. Always include the support case number.