Warning This document has not been updated for a while now. It may be out of date.

Last updated: 9 Jun 2020

search-api: Decision record: Transition mainstream formats to a Publishing API derived search index

Date: 2017-08-24

Definitions

Throughout this document, format refers to the Rummager field format, not the document type provided by Publishing API (content_store_document_type).

Every content_store_document_type is mapped to a single format.

mainstream is used as an example of a search index we want to retire. This document should also apply to the government and detailed indexes.

Context

We're building a new search index using publishing API data

We're currently replacing the existing search indexes (mainstream, government, detailed) with a single index (govuk), which is derived from Publishing API.

The existing indexes are populated from at least 10 different applications.

This change should improve the quality of the data our search system relies on:

all formats are handled in a consistent manner
we can update documents after any publishing event (publishing, withdrawing, unpublishing, tagging)
it should be easier to rebuild if we need to revert to an old backup

Lifecycle of our search indexes

Rumamger's search indexes are long lived, but content is regularly updated to derive the popularity field from recent pageviews. As of ADR 003: Popularity updating without index locks, this uses updates, rather than rebuilding the search index every time.

There is a separate task in place to reindex the content currently in the search indexes with zero downtime. This is something we need to do after adding new fields or otherwise changing the Elasticsearch mappings, for those fields to work properly.

Problems we're not addressing now

Neither of the above mechanisms add or remove documents from the search index, which means that:

if an edition is ever published, and the search index doesn't get updated, it stays out of date until the next time the document is updated
if an edition is unpublished without Rummager being notified, it stays in the search index forever

Long term, we'd like to be able to easily rebuild the whole govuk index from scratch, to avoid these kind of problems, but we aren't aiming to do this right now.

Immediate needs

We need to be able to populate the govuk index with all the documents that already exist, so we can start using this index in the Search API.

We also want to be able to reindex everything that has been published within a short period of time (1 or 2 days), so that we can easily recover the index from backups.

Moving formats one at a time

We're intending to switch on the new index format-by-format, by querying both old and new indexes, and using filters to select which index gets used for each format.

This means we can retire old search indexing code in publishing apps, and search indexes that are no longer needed, without having to populate everything all at once.

We can also revert back to the old index with a simple configuration change if something goes wrong.

A problem we've discovered with this approach is that the relevancy of a document within an index depends on the other documents in that index. If an index is only partially populated, the TF-IDF statistics are less representative, and this affects search results.

Decision

Firstly, we'll implement a task to bulk-reindex chunks of content from publishing API. Rummager will process this content in the same way as regular publishing updates.

Reindexing by format lets us initially populate the new index.
Reindexing by date range lets us bring the govuk index up to date when restoring from a backup.

Secondly, when we change the indexing process for a format to use the new index, the format will go through the following phases:

Phase 1: Untransitioned

At search time, Rummager reads untransitioned formats from the old indexes. Documents belonging to untransitioned formats that are stored in the govuk index will be filtered out.

At index time, Rummager will ignore publishing API messages affecting untransitioned formats.

The nightly update job will update the popularity field of untransitioned formats in the mainstream index. It will also copy untransitioned format documents from the mainstream index into the govuk index (it doesn't matter which order these two things happen).

Net effect: untransitioned formats are considered in the TF-IDF statistics of transitioned formats, but are not ready to be returned from the govuk index themselves.

Phase 2: Indexed

At search time, the behaviour is the same as an untransitioned format.

At index time, Rummager will insert documents into the govuk index.

The nightly update will update the popularity field in the govuk index.

As a one off task, we'll delete all existing data for the format from the govuk index, and reindex it.

Net effect: the data in the govuk index comes from Publishing API data for indexed formats.

Phase 3: Transitioned

At search time, Rummager reads transitioned formats from the govuk index. Documents belonging to transitioned formats that are stored in the mainstream index will be filtered out.

At index time, Rummager will insert documents into the govuk index.

The nightly update will update the popularity field in the govuk index.

Net effect: the search API uses Publishing API-derived data for transitioned formats.

Consequences

Copying mainstream data into govuk adds a layer of complexity to the nightly popularity updater, that we won't be able to get rid of until we've retired that index.

Since we've decided not implement the ability to generate indexes from scratch now, any content not removed from search when it is unpublished will stay that way forever.

We will continue to address this by manually removing content in search admin.