Elasticsearch cluster health
Elasticsearch reports cluster health as one of three possible states, based on the state of its primary and replica shards.
green
- all primary and secondary (replica) shards are allocated. There are two (or more) copies of all shards across the nodes on the cluster.yellow
- all primary shards are allocated, but at least one replica shard is not allocated. Any shards that only exist as a primary are at risk of data loss should another node in the cluster fail.red
- at least one primary shard is not allocated. This can happen when an entire cluster is cold started, before it’s initially allocated the primary shards. If it happens at other times this may be a sign of data loss.
More comprehensive documentation on cluster health can be found in the Elasticsearch documentation.
Make sure you understand the consequences of the problem before jumping to a solution.
Icinga uses the check_elasticsearch_aws
check from nagios-plugins to
monitor the health of the AWS managed Elasticsearch cluster. This plugin uses
various endpoints of the Elasticsearch API, but also extrapolates additional
information to help you diagnose any problems.
Investigating problems
View a live dashboard
Follow the instructions to login to the AWS Console UI.
There are tabs for ‘Cluster health’ and ‘Instance health’. The graphs in the console link to AWS Cloudwatch, where historic metrics can be viewed over custom time periods.
Use the Elasticsearch API
An alternative to using the dashboard is accessing the Elasticsearch health API
yourself. Start with the /_cluster/health
endpoint
and go from there.
Response JSON from the /_cluster/health
endpoint looks like:
{
"cluster_name":"logging",
"status":"green",
"timed_out":false,
"number_of_nodes":3,
"number_of_data_nodes":3,
"active_primary_shards":225,
"active_shards":335,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":0
}
A tunnel to Elasticsearch in a specific environment (e.g staging) can be created using the following:
gds govuk connect ssh --environment staging search -- -N -L 9200:elasticsearch6:80
Elasticsearch will then be available at http://localhost:9200.
Logging
Access to logs is detailed in the logging documentation.
Fixing issues with the cluster
GOV.UK have a Enterprise level support plan with AWS for staging and production. Since we are using a managed service, AWS should be the first point of contact for fixing issues with the Elasticsearch cluster. They can be contacted by telephone, live chat or support request.
Response times are:
- General guidance: 24 hours
- System impaired: 12 hours
- Production system impaired: 4 hours
- Production system down: 1 hour
- Business-critical system down: 15 minutes
All requests are created through the AWS Console. Be sure to assume the correct role first, for the environment in question.