search-api: Search Quality Metrics
We assess the quality of Search API search results with two metrics: Click Through Rate (an online metric) and Normalised Discounted Cumulative Gain (nDCG).
We monitor changes to search relevance over time in a Search Relevancy Dashboard. We call the metrics displayed there online metrics.
Our main online metric is Click Through Rate on the top results (top 1, 3, and 5). This isn't a great metric, since users might click on something that isn't what they were looking for. But this serves our needs in the absence of a more sophisticated way of measuring user success following a search.
We also measure nDCG before and after re-ranking over time, to tell us how search is performing against relevance judgements.
Our main offline metric is nDCG. We measure this before and after re-ranking by our learning to rank model.
We use Elasticsearch's Ranking Evaluation API to assess the quality of results retrieved from Elasticsearch prior to re-ranking. The API enables us to score how well search-api ranks results for a given query.
What is rank evaluation?
Rank evaluation is a process by which we rate how well search-api ranks results for a given query.
For example, given the query
Harry Potter in an ideal situation, the results
would be returned in this order:
- Harry Potter
- Harry Potter World
- Who is Harry Potter?
However, the results might come back in this order:
- Who is Harry Potter?
- Sign in to your HMRC account
- Harry Potter
The first situation is good and the second is bad. But what we really want is a metric to say how bad the results are.
Rank evaluation provides us with a metric which tells us how good or bad the results are for a set of queries that we have already manually ranked.
So we have a rake task
debug:rank_evaluation that does this.
This is useful, because we can tell how good a change is before running an AB test on real users.
$ rake debug:rank_evaluation harry potter: 0.6297902553883483 passport: 0.7926871059630242 contact: 0.9957591943893841 ... Overall score: 0.8209391842392532 (average of all scores)
The above means that the queries for
contact are both
returning better results than the query for
Moreover, if the score for
harry potter was
we made a change, then that means we've made a good change for that query.
This can be measured over time, and it is! See the section on 'What do we do with the query scores?' below.
How do we compute a score for a query?
Given a set of queries and a list of manually rated documents, the API tells us how well we are ranking the results of queries given the manual ratings we have supplied.
A score of 1 is perfect and a score of 0 is catastrophic.
But how do we get to the number
0.6297902553883483 for the query
We use nDCG (normalised Discounted Cumulative Gain). DCG is a measure of ranking quality.
By using DCG we make two assumptions:
- Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)
- Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than non-relevant documents.
DCG requires a query and a list of relevancy judgements (rated documents).
We manually give a rating between 0 and 3 to documents in search results:
0 = irrelevant (which is equivalent to unrated in DCG) 1 = misplaced 2 = near 3 = relevant
For example, for the query
harry potter we manually rate the documents in the
QUERY RATING DOCUMENT harry potter 2 Who is Harry Potter? harry potter 0 Sign in to your HMRC account harry potter 3 Harry Potter harry potter 2 Harry Potter World ...
We then provide this to the Elasticsearch Rank Evaluation API, which behind the
scenes does a query for harry potter, and compares the ratings we provided with
what the actual results are, computes normalised DCG (number between 0 and 1)
and returns it to us. Thus
harry potter = 0.6297902553883483 at this moment
What do we do with the query scores?
We use them for checking changes locally, as a quick check before we run an AB test.
bundle exec rake relevance:ndcg
We also report the rank evaluation scores to graphite. This enables us to plot how relevancy changes over time (overall and for a given query).
This runs every 3 hours in all environments:
SEND_TO_GRAPHITE=true bundle exec rake relevance:ndcg
See the Search Relevancy grafana dashboard.
How do we collect relevancy judgements?
We use a combination of implicit and explicit relevance judgements.
The implicit relevance judgements are derived from user click data.
The explicit judgements are provided by experts across government via the search relevance tool: https://github.com/alphagov/govuk-search-relevance-tool.
Relevance judgements are uploaded in CSV format to an S3 bucket, which then gets pulled by search-api when the scheduled job runs.