What to do if someone says search is down
As described in GOV.UK Search: how it works, there are two Search stacks on GOV.UK. This documentation aims to provide some debugging steps to take in the event that someone tells you “search is down”.
This information is currently summarised as a flow chart
Identify what exactly is broken
Searches from search/all with a query param are sent to SearchAPI v2, and google vertex. Without a query param, requests are sent to search-api (v1), which talks to Elasticsearch. So a quick way to identify which search stack is implicated is to see if searching with or without a query param results in different behaviour.
Check the error rates for site search
The GOV.UK Search Grafana dashboard visualizes metrics for site search. The Search error rate is usually close to, or at 0.00%. If the errors here are high, this suggests site search is implicated. Skip to If site search is unavailable.
Check error rates for search v1
Check the application dashboard for SearchAPI (v1) to see if it’s serving a high number of errors.
Are all dynamic pages failing to load?
Pages that display search results - from either the new or old search stack - are dynamic and therefore cannot be mirrored. An incident involving the failure of a rendering application or a general frontend misconfiguration is often spotted first on search/all, even though it is not a failure of a search product. To identify if this is the case, you could check:
- Complain about your council which uses site search and is rendered by Frontend.
- Guidance and Regulation finder which uses old search and is rendered by Finder frontend.
If both of those pages fail to load, it is highly unlikely that the cause is a failure of both search products and more likely a broader issue.
If site search is unavailable
If the problem appears to be with Site search, is the problem is with SearchAPI v2 or with Google Vertex which provides our search results?
Things to check:
1. Sentry alerts in #govuk-search-alerts
Check for unexpected sentry errors that might help identify issues. In addition to catching general exceptions, Sentry is also used to catch synchronisation errors (when new content is pushed to the VAIS datastore).
A high number of synchronisation errors would suggest that new content is failing to reach our datastore, rather than failing to be returned as search results.
2. Application logs in Kibana
SearchAPIv2 raises a DiscoveryEngine::InternalError in the event of an error from Vertex. We do not surface those errors in Sentry, but they can be found in Kibana. A quick hacky way to find errors is to search for the Rails logger message “Did not get search results”. A high number of these requests suggests a problem with Vertex, which should be raised with Google support.
If site search is returning bad results
Search result relevance is fine tuned via a combination of:
- Boosts and demotions applied at the serving configuration level
- Boosts for recency applied at the application level at query time
- Constant training of the model on user event (GA4) data sent to vertex from our BiqQuery account.
Recent changes to these files, or a failure of the user event data import would all be candidate causes of a reduction in search result quality. But they would be unlikely to have a catastrophically bad impact.