Warning This document has not been updated for a while now. It may be out of date.

Last updated: 9 Jun 2020

search-api: Decision record: Transition whitehall documents to a Publishing API derived search index

Date: 2017-12-21

Definitions

Throughout this document, format refers to the Rummager field format, not the document type provided by Publishing API (content_store_document_type).

Every content_store_document_type is mapped to a single format.

Context

Publishing applications used to directly communicate with rummager to index documents, using the /documents HTTP API.

We've removed this integration for all publishing apps except for whitehall publisher. The publishing API now notifies rummager of every update, using a RabbitMQ message queue. Rummager responds to these messages by updating the govuk index. This has allowed us to remove the old mainstream index, which contained all sorts of documents beyond what we normally think of as "mainstream" content. The govuk index can be rebuilt from scratch by resending the content from publishing API.

See ADR 006 for more details about moving mainstream content to the govuk index.

This leaves two indexes that are populated the old way: government and detailed.

detailed contains detailed guides and government contains everything else published by Whitehall publisher.

We also have a separate worker (publishing-queue-listener) that listens to *.links notifications from publishing API, and updates the old indexes (but doesn't add any new content).

Decision

We intend to get rid of the government and detailed indexes. Rummager should index Whitehall Publisher documents into the govuk index.

All search indexing code should be removed from Whitehall Publisher.

The old indexing code should be removed from Rummager:

Anything in lib/indexer
The publishing-queue-listener worker

The infrastructure to log indexing requests to disk can also be removed, since data from the publishing API can be resent at any time.

Consequences

There are over 200,000 documents published by Whitehall Publisher.

There is also a lot of information about content that isn't currently available to rummager. For example:

The state of a document, based on archiving policy and content audits
Publishing history details besides the public_updated_at value
Who the content is applicable to or where it is relevant
What services or task lists the content belongs to
Entities mentioned in the text of the page (or its attachments), like people, organisations, places, deadlines and forms.

When all documents are indexed from the publishing API, we will be able to make better use of the available data to improve search, because there will be a single indexing process that we can change easily.

This work is also a dependency for running Rummager in the draft stack, to support preview behaviour and draft taxonomies.

Combining all of the indexes into a single index will change the document frequency statistics, which affects how search results are scored.

Action plan

Copy over documents as-is and test

Before making any changes to Whitehall Publisher, we should copy all documents from the government and detailed indexes to the govuk index.

These documents will be filtered out of search results, but will have an indirect impact on scoring of other documents.

We can then run the search healthcheck to understand the impact of this. At this stage we may want to reevaluate how much each format gets boosted in the search query.

Once this is done, the govuk indexing process can be extended to Whitehall Publisher formats.

Follow the process to move a document type to the new indexing process

See new indexing process

This process needs to be repeated for each document type or group of related document types.

Delete the old indexes

Once every document type is done, check for missing documents and then remove the old indexes and indexing code.

See example script for checking all documents are indexed correctly.