Skip to main content
Table of contents
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert
Last updated: 10 Nov 2020

RabbitMQ: Dead nodes in cluster

This can happen if one of the machines in the cluster is killed by AWS and replaced with a new machine. In this scenario, the cluster is still working, but leaving the dead node will cause problems in future e.g. when we try to reboot the machines to install updates. The check can be found here.

First, you should check which node is dead:

fab production_aws class:rabbitmq rabbitmq.status

If a node is dead, there will be more "nodes" than "running_nodes". Once you know the identity of the dead node, you should verify whether the machine still exists:

gds govuk connect -e production ssh aws/jumpbox "govuk_node_list -C rabbitmq"

If the dead node is not in the list, then it can be safely removed from the cluster. To do this, SSH onto one of the RabbitMQ machines and run the following with the dead node:

sudo rabbitmqctl forget_cluster_node [dead_node e.g. rabbit@ip-10-13-5-19]

For information about how we use RabbitMQ, see here.