Skip to main content
Last updated: 19 May 2026

GOV.UK Site search alerts and monitoring

As described in GOV.UK Search: How it works there are two search stacks on GOV.UK. This page includes information on the monitoring and alerting for GOV.UK site search only.

Monitoring and alerting tools

Sentry

Sentry is set up to track application errors from Finder Frontend, Search API v2 and Search API v2 beta features including the sync process and API calls from frontend apps.

All environments are configured for Slack notifications. Production alerts are routed to #govuk-search-alerts, while alerts for other environments are sent to #govuk-search-alerts-nonprod.

Kibana

Kibana can be used to query application logs as per all other GOV.UK apps.

Grafana

The GOV.UK Search Grafana dashboard visualizes core metrics for site search.

Alertmanager

Alertmanager is used to monitor metrics in Prometheus, and send notifications when metrics cross predefined thresholds.

All environments are configured for Slack notifications. Production alerts are routed to #govuk-search-alerts, while alerts for other environments are sent to #govuk-search-alerts-nonprod.

Alert types and how to handle them

Degradation of service alerts

We monitor and alert on both Search API v2 success rates, and Google Discovery Engine response times.

Search API v2 success rates

We have an informal SLO to maintain a search and autocomplete success rate of about 99.99% over any 24 hour period. There are currently four Alertmanager rules configured in govuk-helm-charts to send notifications on Slack, if rates drop below this:

Search alerts
  • SearchDegradedAcute (Critical) 5 minute rolling success rate for search requests has dropped below 99% for more than 10 minutes.

  • SearchDegradedMid (Critical) 1 hour rolling success rate for search requests has dropped below 99.9% for more than 2 hours.

  • SearchDegradedLong (Warning) 24 hour rolling success rate for search requests has dropped below 99.99% for more than 24 hours.

Autocomplete alerts
  • AutocompleteDegradedAcute (Critical) 5 minute rolling success rate for autocomplete requests has dropped below 90% for more than 10 minutes.

  • Note on future work: The Search team is developing improved autocomplete monitoring. Currently our success rate will remain 100% if the Discovery Engine API responds with a 200 success code but provides no autocomplete suggestions for queries where suggestions are available.

Google Cloud Discovery Engine request durations

There is currently one Alertmanager rule configured in govuk-helm-charts, HighVertexP90Latency, which sends notifications in Slack if requests to Google Cloud Discovery Engine search endpoint exceed acceptable duration thresholds.

Causes of degradation of service alerts firing

A common cause of drops in search success rate, is high latency from the DiscoveryEngine API. This will result in the Google::Cloud::DiscoveryEngine Ruby client timing out and raising Google::Cloud::DeadlineExceededError errors, which in turn lead Search API v2 to respond with 500 errors for search result requests.

We are also aware of the following occasional errors which should not be considered critical and do not need intervention unless they occur consistently for a large number of users and trigger critical alerts.

  • Google::Cloud::InternalError An internal error occurred on the Google API
  • AMQ::Protocol::EmptyResponseError RabbitMQ sent an unexpected response, possibly due to restarting (the listener will restart by itself in most cases)

Steps to take in the event of a critical alert being triggered

If a high number of Google::Cloud::DeadlineExceededError or Google::Cloud::InternalError errors persist and trigger a critical degradation of service alert, this indicates an issue with GCP or Discovery Engine that we should raise with Google.

  1. Login to the Google Cloud console, and make sure that the project Search API v2 Production is selected.
  2. Under “APIs & Services” in the GCP Console, review the Discovery Engine API usage for traffic and error rates.
  3. Issues with Discovery Engine should be raised with Google.

Quantifying the user impact of low search success rates

A degradation in Search API v2 success rates indicates that users will be experiencing a problem with site search on GOV.UK. However it is not a quantitative representation of the number of users being impacted. A better measure would be to look at the error rates for Finder Frontend or Router for requests to search/all with a keyword.

Suggested Kibana query:

logjson.request_uri: *search/all?keywords=*
kubernetes.labels.app_kubernetes_io/name: finder-frontend
logjson.status: 500 to 503
kubernetes.container.name: nginx

The most accurate way to measure the number of the error pages shown to users is to use Athena to query the Fastly CDN logs

Suggested Athena query:

SELECT status, COUNT(*) AS count
FROM fastly_logs.govuk_www
WHERE date = 8 AND month = 5 AND year = 2026
  AND request_received >= TIMESTAMP '2026-05-08 16:00:00.000'
  AND request_received < TIMESTAMP '2026-05-08 23:00:00.000'
  AND url LIKE '%/search/all.html?keywords=%'
GROUP BY status
ORDER BY count DESC

Degradation of search quality alerts

We have additional Alertmanager rules related to search result quality configured in govuk-helm-charts to send notifications on Slack, if search quality drops below given thresholds:

Steps to take in the event of a degradation of search quality firing

Significant drops in search quality need to be investigated by the search team to diagnose the issue and raise a support ticket with Google, if appropriate. If you notice a drop in search quality, make sure the Performance Analyst, Product Manager and Delivery Manager on the search team are aware.

Evaluations are run regularly on a schedule, to measure and monitor the quality of search results and notify the search team if quality drops via “degradation of quality” alerts. Parts of this workflow are executed via rake tasks in Search API v2 beta features, and raise errors in Sentry when they fail. The most common errors are:

  • Google::Cloud::AlreadyExistsError (Active evaluation already exists): Only one evaluation in each environment is allowed to be run at one time. This error occurs when an evaluation is triggered when there is already one running.
  • Google::Cloud::NotFoundError (SampleQuerySet): Occurs when a sample query set has not been created, but a rake task is attempting to use it.
  • Google::Cloud::NotFoundError (ServingConfig): Occurs when an evaluation cannot locate the specified serving configuration.

Because the evaluations feature is in beta, rake tasks can also fail for unexpected reasons.

Steps to take in the event of a sample query set not being created

If a rake task is attempting to use a sample query set that does not exist and a related Google::Cloud::NotFoundError is being raised, the relevant sample query set should be created manually by re-running the cronJob that creates the sample query sets in Argo. This rake task will run through to create all sample query sets (binary, clickstream and explicit) for the current month, and import the data from BigQuery. Where the sample query set has been created already, the creation of the sample query set will be skipped and the data will be reimported (replaced) into the sample query set.

It is important to ensure that all required sample query sets exist in all environments, since evaluations cannot run without them.

Steps to take in the event of a failed evaluation run

If an evaluation fails for a reason other than a sample query set not existing, take the following steps:

  1. Check the environment. If the failure is in production, carry on through the next steps. Otherwise, do nothing. (Quality of search results is less meaningful/important in non-production environments.)

  2. Check which evaluations have failed by checking logs in Kibana. To get an effective read on current quality of search results, we need to be regularly reporting our important metrics for “this month”. In practice this can mean:

- Ignoring any failed runs of "last month" evaluations.
- Ignoring any failed runs of the explicit evaluations.
- Ensuring there is a successful run of the "this month" clickstream evaluation at least every other day.
- Ensuring there is a successful run of the "this month" binary evaluation at least twice a day.

If an evaluation needs to be re-run, carry on through the next steps.
  1. (Optional) Temporarily pause scheduled evaluations. Because only one evaluation can run at a time in each environment, if a scheduled evaluation starts before an ad-hoc run has finished, the scheduled evaluation is likely to fail and raise more alerts. If you want to avoid this, you can edit the cron schedule in helm charts to temporarily pause scheduled evaluations.

  2. Re-run the relevant evaluation cronJob in Argo for the evaluation that has failed. Note that the evaluations rake task will run the evaluation for both “this month” and “last month”.

  3. Restore usual schedule of evaluation runs. This is only relevant if scheduled evaluations were temporarily paused in step 3.

How to contact Google if there is a critical issue with GCP or Discovery Engine

  1. To raise a support ticket, you will first need to login to the GCP console
  2. Navigate to the Support/Cases section and press the GET HELP button, ensuring to provide comprehensive reproduction steps.
  3. For catastrophic issues or if regular support is unresponsive, escalate the problem in the Google Chat space, instructions are linked from the #govuk-search team slack channel. Always include the support case number.