search-api: Decision record: Upgrade to Elasticsearch 6
Date: 2019-10-03
- High-level migration plan
- New cluster architecture
- Elasticsearch 6 compatibility
- Connecting to multiple Elasticsearch clusters
- Synchronising the state
- Data sync for production -> staging -> integration
- A/B testing Elasticsearch 6
- Decomissioning Elasticsearch 5
This work was done over 2019/20 Q1 and Q2. Most of the changes were made in this repository. There were also significant changes in govuk-aws, govuk-aws-data, and govuk-puppet.
High-level migration plan
Based on our experiences migrating to Elasticsearch 5 we decided that this time we wanted to isolate the complexity of having two Elasticsearch clusters to search-api:
- Make search-api work with both Elasticsearch 5 and Elasticsearch 6.
- Give search-api support for multiple Elasticsearch clusters:
- Index changes into all clusters.
- Add a URL parameter to select the cluster to use for a query.
- Perform a one-time import of data from Elasticsearch 5 to Elasticsearch 6, and synchronise any updates missed while this import was running.
- Confirm that the Elasticsearch 5 and Elasticsearch 6 clusters are consistent.
- Set up an A/B test for finder-frontend (etc) to select which Elasticsearch cluster to use.
- Gradually phase from 100% Elasticsearch 5 to 100% Elasticsearch 6.
- Retire Elasticsearch 5.
The approach we took last time of having both rummager and search-api running in parallel, necessitated by one being in Carrenza and one in AWS, led to a lot of changes to govuk-puppet, and didn't really give us anything reusable. By adding multi-cluster support to search-api directly, we do get something reusable.
New cluster architecture
Based on our experience of running Elasticsearch 5 for a quarter, we at first opted for:
- Three dedicated master nodes, of type
c4.large.elasticsearch
. - Six data nodes, of type
r4.large.elasticsearch
.
We found the data nodes were not powerful enough, and switched to
r4.xlarge.elasticsearch
. After talking to an AWS Solutions
Architect, we changed again to r5.xlarge.elasticsearch
. Due to
wider GOV.UK scaling concerns caused by political activity, we
increased the data nodes to r5.2xlarge.elasticsearch
.
We also shrunk the cluster in staging (to below production size), as it does not need to cope with the indexing load. The final cluster sizes are:
Environment | Master | Data |
---|---|---|
Production | 3x c4.large.elasticsearch
|
6x r5.2xlarge.elasticsearch
|
Staging | 3x c4.large.elasticsearch
|
3x r5.2xlarge.elasticsearch
|
Integration | 3x t2.medium.elasticsearch
|
3x r5.large.elasticsearch
|
The cluster is configured in the app-elasticsearch6 Terraform project.
Elasticsearch 6 compatibility
We used deprecation messages from Elasticsearch 5 and error messages from Elasticsearch 6 to figure out what needed to change. There were some changes needed to the indices and some to the queries.
Index changes were:
- Switching from
string
fields totext
andkeyword
fields (PR #1553). - Removing the use of the
_all
field andinclude_in_all
(PR #1557).
Query changes were:
- Using
like
instead ofdocs
in themore_like_this
query (PR #1561). - Replacing
match: { type: phrase }
withmatch_phrase
queries (PR #1564). - Replacing
indices
query with ashould
query (PR #1568).
The indices
/should
change introduced an inefficiency where query
generation now needs to ask Elasticsearch for the real name of an
alias. We added a new build_query
metric to keep track of this.
Elasticsearch issue #23306, opened in February 2017, is about a
solution to this problem.
We also changed the text similarity metric from classic similarity to the new BM25 similarity.
Connecting to multiple Elasticsearch clusters
The elasticsearch.yml
file contains a list of clusters. These
clusters can specify their own schema configuration file, which means
we can try out different index-level settings in a new cluster (for
example, changing the text similarity metric only for Elasticsearch
6).
In the code, the multi-cluster support is implemented in the
SearchConfig
and Index::Client
classes. Cluster selection is
implemented in the SearchParameterParser
class.
This involved some significant refactoring (PR #1569 and PR #1604).
The main architectural decisions made here were:
- To index writes to all clusters, ensuring cluster consistency.
- To perform queries against the default cluster unless a URL parameter is given, so we can A/B test clusters.
- To have one
SearchConfig
singleton per cluster.
This change did result in details of clusters leaking throughout
search-api, which is unfortunate, but that can be addressed in future
refactoring work (for example, passing around SearchConfig
instances
rather than asking for cluster singletons).
Synchronising the state
We planned to either take a snapshot from Elasticsearch 5 and restore it to Elasticsearch 6, or to use the same approach as the last upgrade (a script to copy the data across), but these turned out to be unnecessary.
We had search-api running with both clusters for a couple of weeks
before we started to think about synchronising the data, and by that
time everything had been republished (either directly or as a result
of dependency resolution) and so the govuk
, government
, and
detailed
indices were consistent with Elasticsearch 5.
The page-traffic
index was also handled in the multi-cluster work,
with traffic data being saved to all clusters.
The only index which needed some manual work was metasearch
, which
holds best bets. Attempting to republish these from search-admin
kept failing due to transient network issues, and getting all of the
best bets seemed impossible. So for the metasearch
index we used
this script:
from elasticsearch5 import Elasticsearch as Elasticsearch5, TransportError as TransportError5
from elasticsearch6 import Elasticsearch as Elasticsearch6, TransportError as TransportError6
from elasticsearch6.helpers import bulk
from datetime import datetime
import os
INDEX = 'metasearch'
GENERIC_DOC_TYPE = 'generic-document'
ES5_HOST_PORT = os.getenv('ES5_ORIGIN_HOST', 'http://elasticsearch5:80')
ES6_TARGET_PORT = os.getenv('ES6_TARGET_HOST', 'http://elasticsearch6:80')
es_client5 = Elasticsearch5([ES5_HOST_PORT])
es_client6 = Elasticsearch6([ES6_TARGET_PORT])
def _prepare_docs_for_bulk_insert(docs):
for doc in docs:
yield {
"_id": doc['_id'],
"_source": doc['_source'],
}
def bulk_index_documents_to_es6(documents):
try:
bulk(
es_client6,
_prepare_docs_for_bulk_insert(documents),
index=INDEX,
doc_type=GENERIC_DOC_TYPE,
chunk_size=100
)
except TransportError6 as e:
print("Failed to index documents: %s", str(e))
def fetch_documents(from_=0, page_size=100, scroll_id=None):
try:
if scroll_id is None:
results = es_client5.search(INDEX, GENERIC_DOC_TYPE, from_=from_, size=page_size, scroll='2m')
scroll_id = results['_scroll_id']
else:
results = es_client5.scroll(scroll_id=scroll_id, scroll='2m')
docs = results['hits']['hits']
return (scroll_id, docs)
except TransportError5 as e:
print("Failed to fetch documents: %s", str(e))
return str(e), e.status_code
if __name__ == '__main__':
start = datetime.now()
dcount = es_client5.count(index=INDEX, doc_type=GENERIC_DOC_TYPE)['count']
print('Preparing to index {} document(s) from ES5'.format(dcount))
offset = 0
page_size = 250
scroll_id = None
while offset <= dcount:
scroll_id, docs = fetch_documents(from_=offset, page_size=page_size, scroll_id=scroll_id)
print('Indexing documents {} to {} into ES6'.format(offset, offset+page_size))
bulk_index_documents_to_es6(docs)
offset += page_size
print('Finished in {} seconds'.format(datetime.now() - start))
Data sync for production -> staging -> integration
This was done in the same way as for Elasticsearch 5.
The data sync script needed to be modified to allow for a different
host to be used (http://elasticsearch6
vs http://elasticsearch5
)
but otherwise this was just a matter of writing some more
configuration.
A/B testing Elasticsearch 6
The A/B test was set up like so:
- Configuration in govuk-cdn-config.
- Logic in finder-frontend to pass one of two URL parameters to search-api, based on the CDN-level A/B test.
- Logic in search-api to choose a cluster based on the parameter from finder-frontend.
The general A/B test process is covered in the dev docs.
For the A/B test we monitored click-through rate, proportion of search refinements, and proportion of search exits. The test revealed a significant degradation in our metrics compared with Elasticsearch 5, and we were unable to proceed with the switch without doing work to improve the search results.
The two main changes in Elasticsearch 6 which impacted search result quality were:
-
Switching from classic similarity to BM25 similarity. This issue arose with Elasticsearch 5, but we decided to go ahead with the new similarity when moving to Elasticsearch 6.
-
The removal of query coordination factors, affecting the scoring of multi-clause
should
andmust
queries. This meant that even if we switched back to classic similarity, we wouldn't get the same results we had with Elasticsearch 5.
With query coordination factors, multi-clause should
and must
queries are scored as:
sum(clause scores) * num(matching clauses) / num(clauses)
So if a query has 7 clauses and 2 of them match, the overall score is multipled by 2/7. The effect of this is to make documents which match multiple clauses tend to rank higher than documents which match fewer clauses, even if those fewer clauses are matched really well. The assumption is that the number of matching clauses is an important predictor of relevance.
Without query coordination factors, the query is scored as:
sum(clause scores)
Figuring out how to improve the search query was not a straightforward, or particularly systematic, process.
Our Elasticsearch 5 query is:
{
bool: {
should: [
match_phrase("title", query),
match_phrase("acronym", query),
match_phrase("description", query),
match_phrase("indexable_content", query),
match_all_terms(%w(title acronym description indexable_content), query),
match_any_terms(%w(title acronym description indexable_content), query),
minimum_should_match("all_searchable_text", query)
],
}
}
And the Elasticsearch 6 query we settled on is:
should_coord_query([
match_all_terms(%w(title), query, MATCH_ALL_TITLE_BOOST),
match_all_terms(%w(acronym), query, MATCH_ALL_ACRONYM_BOOST),
match_all_terms(%w(description), query, MATCH_ALL_DESCRIPTION_BOOST),
match_all_terms(%w(indexable_content), query, MATCH_ALL_INDEXABLE_CONTENT_BOOST),
match_all_terms(%w(title acronym description indexable_content), query, MATCH_ALL_MULTI_BOOST),
match_any_terms(%w(title acronym description indexable_content), query, MATCH_ANY_MULTI_BOOST),
minimum_should_match("all_searchable_text", query, MATCH_MINIMUM_BOOST)
])
Here should_coord_query
is a reimplementation of the query
coordination factor-based scoring, using a function_score
query. We also changed the match_phrase
clauses in the
Elasticsearch 5 query to match_all_terms
clauses, and adjusted the
field boosting factors.
We then ran the A/B test again, and found that the Elasticsearch 6 with the new query had a clickthrough rate within 3 percentage points of Elasticsearch 5 with the old query, and decided that this was good enough to go ahead with the switch.
Decomissioning Elasticsearch 5
The steps we followed to switch to Elasticsearch 6 permanently and to decomission Elasticsearch 5 were:
- Switch search-api to the B variant of the A/B test and disable the A cluster so no more indexing is done (PR #1713).
- Disable the A/B test in govuk-cdn-config (PR #203) and finder-frontend (PR #1611).
- Remove Elasticsearch 5 configuration from govuk-puppet (PR #9643).
- Remove Elasticsearch 5 from govuk-aws (PR #1123).
We didn't want to be permanently using the "B" configuration, which required some care to change:
- Set the
ELASTICSEARCH_URI
environment variable to the same value as theELASTICSEARCH_B_URI
environment variable in govuk-puppet (PR #9648). - Swap out the "B" cluster configuration for the "A" cluster configuration (PR #1718).
- Unset the
ELASTICSEARCH_B_URI
environment variable in govuk-puppet (PR #9649).
This approach avoided the need to coordinate simultaneous deploys of search-api and govuk-puppet.