Last updated: 15 May 2024

search-api: Search Quality Metrics

We assess the quality of Search API search results with two metrics: Click Through Rate (an online metric) and Normalised Discounted Cumulative Gain (nDCG).

Online metrics

We monitor changes to search relevance over time in a Search Relevancy Dashboard. We call the metrics displayed there online metrics.

Our main online metric is Click Through Rate on the top results (top 1, 3, and 5). This isn't a great metric, since users might click on something that isn't what they were looking for. But this serves our needs in the absence of a more sophisticated way of measuring user success following a search.

Offline metrics

Our main offline metric is nDCG.

We use Elasticsearch's Ranking Evaluation API to assess the quality of results retrieved from Elasticsearch prior to re-ranking. The API enables us to score how well search-api ranks results for a given query.

What is rank evaluation?

Rank evaluation is a process by which we rate how well search-api ranks results for a given query.

For example, given the query Harry Potter in an ideal situation, the results would be returned in this order:

Harry Potter
Harry Potter World
Who is Harry Potter?

However, the results might come back in this order:

Who is Harry Potter?
Sign in to your HMRC account
Harry Potter

The first situation is good and the second is bad. But what we really want is a metric to say how bad the results are.

Rank evaluation provides us with a metric which tells us how good or bad the results are for a set of queries that we have already manually ranked.

So we have a rake task debug:rank_evaluation that does this.

This is useful, because we can tell how good a change is before running an AB test on real users.

For example:

$ rake debug:rank_evaluation

harry potter:                      0.6297902553883483
passport:                          0.7926871059630242
contact:                           0.9957591943893841
...
Overall score:                     0.8209391842392532 (average of all scores)

The above means that the queries for passport and contact are both returning better results than the query for harry potter.

Moreover, if the score for harry potter was 0.2392687105963024 before we made a change, then that means we've made a good change for that query.

This can be measured over time, and it is! See the section on 'What do we do with the query scores?' below.

How do we compute a score for a query?

Given a set of queries and a list of manually rated documents, the API tells us how well we are ranking the results of queries given the manual ratings we have supplied.

A score of 1 is perfect and a score of 0 is catastrophic.

But how do we get to the number 0.6297902553883483 for the query harry potter?

We use nDCG (normalised Discounted Cumulative Gain). DCG is a measure of ranking quality.

From Wikipedia:

By using DCG we make two assumptions:

Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)

Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than non-relevant documents.

DCG requires a query and a list of relevancy judgements (rated documents).

We manually give a rating between 0 and 3 to documents in search results:

0 = irrelevant (which is equivalent to unrated in DCG)
1 = misplaced
2 = near
3 = relevant

For example, for the query harry potter we manually rate the documents in the results:

QUERY             RATING  DOCUMENT
harry potter      2       Who is Harry Potter?
harry potter      0       Sign in to your HMRC account
harry potter      3       Harry Potter
harry potter      2       Harry Potter World
...

We then provide this to the Elasticsearch Rank Evaluation API, which behind the scenes does a query for harry potter, and compares the ratings we provided with what the actual results are, computes normalised DCG (number between 0 and 1) and returns it to us. Thus harry potter = 0.6297902553883483 at this moment in time.

What do we do with the query scores?

We use them for checking changes locally, as a quick check before we run an AB test.

bundle exec rake relevance:ndcg

We also report the rank evaluation scores to graphite. This enables us to plot how relevancy changes over time (overall and for a given query).

This runs every 3 hours in all environments:

SEND_TO_GRAPHITE=true bundle exec rake relevance:ndcg

See the Search Relevancy grafana dashboard.

How do we collect relevancy judgements?

We use a combination of implicit and explicit relevance judgements.

The implicit relevance judgements are derived from user click data.

The explicit judgements are provided by experts across government via the search relevance tool: https://github.com/alphagov/govuk-search-relevance-tool.

Relevance judgements are uploaded in CSV format to an S3 bucket, which then gets pulled by search-api when the scheduled job runs.