Skip to main content
Last updated: 18 Feb 2026

search-api-v2-beta-features: Evaluations

We use Discovery Engine's in-built evaluations feature to measure the quality and relevance of search results. This allows us to:

  • Monitor search quality with automated alerts if relevance scores drop unexpectedly.

  • Compare versions of the Discovery Engine search engine so that we can test out new configurations to improve search results.

Evaluations rely on 'judgement lists' (known as 'sample query sets' in Discovery Engine terminology). These are sets of search queries paired with targets and their ratings, which indicate how relevant specific results are for those queries. We use three sets of judgement lists to create evaluations. When evaluations are run, point-in-time search results are compared to the sample query sets to create scores, which can be tracked over time.

A note on terminology

The terms "judgement list" and "sample query set" refer to very similar but slightly different concepts and so can easily be confused. For clarity:

  • A judgement list is a both a concept and a dataset that exists in BigQuery. It refers to the curated list of search queries paired with their expected target documents and a relevance score (0–3).

  • A sample query set is a Discovery Engine resource. It is the specific technical format required by Discovery Engine. The data from a judgement list is imported into Discovery Engine as a "sample query set" to actually execute the evaluation.

Judgement lists

Implicit judgement lists

The implicit judgement lists are generated from the results users click on when using site search. Hence they are "implicitly" defined (compared to "explicitly" defined, like the explicit judgement lists).

Clickstream

The clickstream judgement lists are generated from the results users click on when using site search, with results given a score of 0-3 depending on how many clicks they get. Each query can have multiple targets, the number of which varies depending on how many results have been clicked above the threshold. For example:

{
  "queryEntry": {
    "query": "share code",
    "targets": [{
      "uri": "ebf341bc-aa8f-4105-a042-b7ba1b7a110e",
      "score": "3"
    }, {
      "uri": "ce7d39a8-4ae3-4621-b38d-153767270013",
      "score": "2"
    }, {
      "uri": "6f3be85d-4ac0-4f9a-8e78-13ef23081842",
      "score": "2"
    }, {
      "uri": "4b0906d6-8fe8-48c0-ac72-b1ee98cb5e01",
      "score": "2"
    }, {
      "uri": "1f01f4f5-d9c1-416c-8dc8-c459451c7451",
      "score": "1"
    }, {
      "uri": "88e4e053-8345-4015-b747-5e74941553ec",
      "score": "1"
    }, {
      "uri": "abb4ed81-45df-43d6-8f40-045012d23776",
      "score": "1"
    }, {
      "uri": "435fb04f-2b9f-4f44-8b41-2a962e8c46a8",
      "score": "1"
    }, {
      "uri": "63fe19b5-bbcd-4712-9250-a1cd3125c470",
      "score": "1"
    }, {
      "uri": "585fc781-dc17-4a1d-a66c-9515f9970ab5",
      "score": "1"
    }, {
      "uri": "4816fa06-9e6c-4414-96b6-aa7ec7f24968",
      "score": "1"
    }, {
      "uri": "bee455d5-5a4f-440a-88be-eb65ae8fde7d",
      "score": "1"
    }, {
      "uri": "acadabdc-c1ae-42e8-a666-1728f91f4d21",
      "score": "1"
    }, {
      "uri": "2ba2fbbf-c8fd-4e5f-b010-3fe0c1515566",
      "score": "1"
    }]
  }
}

This type of judgement list is used for assessing relevance of search results at scale.

Binary

The binary judgement lists contain a list of search queries with results that have a score of 3, indicating that they’re a perfect match for the query. The judgement lists are made up of individual query-target pairs. A query can appear more than once in a judgement list if it has multiple targets with a score of 3, but these will be split out into separate query-target pairs, rather than having two targets nested under one query as it would be in a clickstream judgement list. For example:

"{
  ""queryEntry"": {
    ""query"": ""check share code"",
    ""targets"": [{
      ""uri"": ""4b0906d6-8fe8-48c0-ac72-b1ee98cb5e01"",
      ""score"": ""3""
    }]
  }
}"
"{
  ""queryEntry"": {
    ""query"": ""check share code"",
    ""targets"": [{
      ""uri"": ""ce7d39a8-4ae3-4621-b38d-153767270013"",
      ""score"": ""3""
    }]
  }
}"

This type of judgement list is used for assessing whether any results that should appear are missing - identifying relevance issues quickly.

Explicit

Manually created and curated judgement lists. The explicit judgements are based on what the search team has determined should be in the top result for a query.

How evaluations are run

End-to-end

This diagram shows the end-to-end process of how an evaluation is run, including how the data is prepared before the evaluation and how results are processed afterwards. Please note that it only covers implicit evaluations for now.

sequenceDiagram
    participant GOV.UK
    participant GA4
    participant Dataform
    participant Big Query
    participant Discovery Engine
    participant Search API V2
    participant GCP Bucket
    participant Prometheus
    activate GOV.UK
    activate GA4
    Note left of GOV.UK: User interaction data on GOV.UK is collected<br/>using Google Analytics.
    GOV.UK->>GA4: User interaction data
    deactivate GA4
    deactivate GOV.UK
    GOV.UK-->Prometheus: 
    Dataform->>+GA4: Data request
    activate Dataform
    Note left of GOV.UK: Dataform runs SQL queries at 1am on the 1st<br/>of each month to create implicit "judgement<br/>lists" ('binary' and 'clickstream'), from 3<br/>months of user interaction data, labelled with<br/>a "partition time" of the start of the month<br/>the query is run during.
    GA4->>-Dataform: Data from 'partitioned<br/>flattened events table'
    Note left of GOV.UK: Example: on 1st November the judgement list<br/>includes data from August, September and<br/>October.
    activate Big Query
    Dataform->>Big Query: Judgement lists
    deactivate Dataform
    deactivate Big Query
    GOV.UK-->Prometheus: 
    Search API V2->>+Discovery Engine: Request to create (empty)<br/>sample query set
    activate Search API V2
    Note left of GOV.UK: A rake task is run at 2am on the first day of<br/>each month to create and import queries into a<br/>single "sample query set", labelled with a<br/>"partition date" of the start of the month the<br/>query is run during, matching the relevant<br/>judgement list.
    Search API V2->>Discovery Engine: Request to import queries<br/>into sample query set<br/>(from BigQuery)
    Note over Discovery Engine,Search API V2: Request body includes<br/>"partition date", which is<br/>used to identify the<br/>queries to import
    Discovery Engine->>+Big Query: Data request
    Note left of GOV.UK: Example: on 1st November we create a sample<br/>query set with a partition date of 1st November<br/>which contains user data from August,<br/>September and October.
    deactivate Search API V2
    Big Query->>-Discovery Engine: Judgement lists
    deactivate Discovery Engine
    GOV.UK-->Prometheus: 
    Search API V2->>+Discovery Engine: Request to create an<br/>evaluation for a specified<br/>sample query set
    activate Search API V2
    Note left of GOV.UK: Evaluations are run on a regular basis via a<br/>scheduled rake task (once per day for<br/>'clickstream', multiple times per (working) day<br/>for 'binary'), which evaluate point-in-time<br/>search results against two sample query sets:<br/>one labelled with a partition date of the start of<br/>the current month, and the other labelled with<br/>a partition date of the start of last month.
    Discovery Engine->>-Discovery Engine: Discovery Engine search<br/>results from queries in<br/>sample query set are<br/>checked against sample<br/>query set targets
    Note left of GOV.UK: Example: every day in November, we run<br/>evaluations against two sample query sets:<br/>one with a partition date of 1st November<br/>(which contains user data from August,<br/>September and October) and the other with a<br/>partition date of 1st October (which contains<br/>user data from July, August and September).
    Search API V2->>+Discovery Engine: Request to get evaluation<br/>results (with wait)
    Discovery Engine->>-Search API V2: Evaluation metrics
    Search API V2->>+GCP Bucket: Detailed metrics
    deactivate GCP Bucket
    Search API V2->>+Prometheus: Quality metrics
    deactivate Prometheus
    deactivate Search API V2

The overall process involves:

  1. Gathering user interaction data on gov.uk. This is done using GA4.
  2. For binary and clickstream, judgement lists are compiled using user interaction data and stored in Big Query. This is done in SQL via Dataform. The explicit judgement lists are compiled manually.
  3. Judgement lists are taken from BigQuery and imported into Discovery Engine as sample query sets. This is done via the setup_sample_query_sets rake task.
  4. Evaluations are run on a regular basis, and results are stored in a GCP Bucket (detailed metrics) and Prometheus (high level metrics). This is done via the report_quality_metrics rake task.

Step 4 is usually what we mean when we say 'run an evaluation'.

Schedule

When first created, the crontasks for running evaluations had the following schedule:

  • Every day at 7am GMT (which is 8am BST) we run evaluations of the clickstream, binary and explicit sample query sets.

  • Every weekday at 10am, 12pm, 2pm, 4pm GMT (which is 11am, 1pm, 3pm and 5pm BST) we run evaluations of the binary sample query sets.

A gap of two hours was added between each type of evaluation run to stop them from clashing. Only one evaluation can be run at time, and each crontask runs two evaluations, one for this month and last month. Each evaluation takes approximately 20-25 minutes to run on average.

See govuk-helm-charts for the current schedule.

How evaluations are scored

The evaluation process calculates three primary metrics. These are derived by comparing actual search results at the time of the evaluation run, against the sample query sets. The metrics are calculated at specific "top-k" cutoff levels (top 1, top 3, top 5, and top 10) to assess performance at different positions in the search results.

  • Recall (docRecall): The fraction of relevant targets in the top-k retrieved out of all relevant targets.

  • Precision (docPrecision): The fraction of retrieved targets in the top-k that are relevant.

  • NDCG (docNDCG): The Normalised Discounted Cumulative Gain at k. This measures the ranking quality of the search results.

See GCP Ruby Client docs for more details.

The full evaluation results that contain metrics for each individual query are known as "detailed metrics". At the end of the evaluation run, the detailed metrics are uploaded to a GCP Bucket. These can also be accessed via BigQuery.

For each evaluation, aggregate metrics are also calculated by averaging the query level results. These are known as "quality metrics". At the end of the evaluation run, quality metrics are pushed to Prometheus.

Important metrics

For the binary evaluations, we pay particular attention to top-3 Recall. This is an easily interpretable metric that allows us to see whether a particular search result appears in the top 3 results for a given query.

For the clickstream evaluations, we pay particular attention to top-10 NDCG. While this can be more challenging to interpret, it allows us to monitor the ranking quality of the first page of search results.

This month and last month

As part of our evaluations methodology, we have chosen to create new sample query sets at the start of month, and keep them static for the duration of that month. This is so we have an apples-to-apples comparison of our evaluation results over a given month. However, at the start of a new month, we often see a jump in our evaluation metrics because we are comparing search results against different sample query sets. To help us identify whether this jump indicates an underlying change in results quality, or is just because of the change in the sample query sets, we run evaluations for this month and last month to give us continuity across month changes. For example, an evaluation run on 30 November against a 1 November ("this month") sample query set can be compared to an evaluation run on 1 December against a 1 November (now "last month") sample query set more easily than it could be compared to an evaulation run on 1 December against a 1 December ("this month") sample query set.

How evaluations are monitored

Evaluations are monitored using Sentry, Kibana, Grafana and Prometheus/Alertmanager. For more information see GOV.UK Site search alerts and monitoring manual.