Table of contents
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert

Icinga alerts

Elasticsearch cluster health

Elasticsearch reports cluster health as one of three possible states, based on the state of its primary and replica shards.

  • green - all primary and secondary (replica) shards are allocated. There are two (or more) copies of all shards across the nodes on the cluster.
  • yellow - all primary shards are allocated, but at least one replica shard is not allocated. Any shards that only exist as a primary are at risk of data loss should another node in the cluster fail.
  • red - at least one primary shard is not allocated. This can happen when an entire cluster is cold started, before it’s initially allocated the primary shards. If it happens at other times this may be a sign of data loss.

More comprehensive documentation on cluster health can be found in the Elasticsearch documentation.

Make sure you understand the consequences of the problem before jumping to a solution.

Icinga uses the check_elasticsearch_aws check from nagios-plugins to monitor the health of the AWS managed Elasticsearch cluster. This plugin uses various endpoints of the Elasticsearch API, but also extrapolates additional information to help you diagnose any problems.

Investigating problems

View a live dashboard

The AWS Console provides a dashboard with graphical UI showing the state of the cluster and individual instances. After logging into the console, you must assume a role for the relevant environment to see the cluster (referred to as a ‘domain’ by AWS).

There are tabs for 'Cluster health’ and 'Instance health’. The graphs in the console link to AWS Cloudwatch, where historic metrics can be viewed over custom time periods.

Use the Elasticsearch API

An alternative to using the dashboard is accessing the Elasticsearch health API yourself. Start with the /_cluster/health endpoint and go from there.

Response JSON from the /_cluster/health endpoint looks like:

{
  "cluster_name":"logging",
  "status":"green",
  "timed_out":false,
  "number_of_nodes":3,
  "number_of_data_nodes":3,
  "active_primary_shards":225,
  "active_shards":335,
  "relocating_shards":0,
  "initializing_shards":0,
  "unassigned_shards":0
}

A tunnel to Elasticsearch in a specific environment (e.g staging) can be created using the following:

ssh -At jumpbox.staging.govuk.digital -L 9200:localhost:9200 "ssh -q \`govuk_node_list --single-node -c search\` -L 9200:elasticsearch5.blue.staging.govuk-internal.digital:80"

Elasticsearch will then be available at http://localhost:9200.

Logging

Access to logs is detailed in the logging documentation.

Fixing issues with the cluster

GOV.UK have a Enterprise level support plan with AWS for staging and production. Since we are using a managed service, AWS should be the first point of contact for fixing issues with the Elasticsearch cluster. They can be contacted by telephone, live chat or support request.

Response times are:

  • General guidance: 24 hours
  • System impaired: 12 hours
  • Production system impaired: 4 hours
  • Production system down: 1 hour
  • Business-critical system down: 15 minutes

All requests are created through the AWS Console. Be sure to assume the correct role first, for the environment in question.

This page was last reviewed on 25 March 2019. It needs to be reviewed again on 25 September 2019 by the page owner #govuk-2ndline .
This page was set to be reviewed before 25 September 2019 by the page owner #govuk-2ndline. This might mean the content is out of date.