Skip to main content
This page describes what to do in case of an Icinga alert. For more information you could search the govuk-puppet repo for the source of the alert
Last updated: 27 Jul 2023

AmazonMQ: No consumers listening to queue

Check that there is at least one non-idle consumer for rabbitmq queue {queue_name}

Icinga connects to AmazonMQ’s RabbitMQ admin API to check on the activity of the consumers and that at least one consumer is running for a given RabbitMQ message queue. See here for the plugin that implements the alert.

For information about how we use RabbitMQ, see [here][rabbitmq_doc].

No consumers listening to queue

This check reports a critical error when no consumers are listening to the queue, meaning messages entering the queue will never be processed.

No activity for X seconds: idle times are [X, Y, Z]

This checks whether the RabbitMQ consumers for a queue have been active in the last 5 minutes. Consumers in an idle state are listening to the queue but are unable to process the messages on it.

Publishing API sends a heartbeat message every minute to the following queues - email-alert-service and content_data_api. This is configured via the queues bindings (e.g email-alert-service’s binding) matching the routing key used in the heartbeat. This should ensure that the consumers are never idle for these queues.

Note

You may see the high unprocessed messages alert too, as issues with consumers processing messages could then lead to a high backlog of messages.

Troubleshooting

Check the AmazonMQ Grafana dashboard

This Grafana dashboard shows activity across multiple exchanges and queues. The main exchange we expect to be monitoring is published_documents which handles broadcasting to services such as search and email-alert-service when content changes across GOV.UK.

Looking at the queue graphs we should look out for the following:

  • Check for high ‘ready’ messages for the queue - indicates these messages are waiting to be processed in RabbitMQ by the consumer (e.g email-alert-service).

  • Check for high ‘unacknowledged’ messages for the queue - implies that messages have been read by the consumer but the consumer has never sent back an ACK to the RabbitMQ broker to say that it has finished processing.

  • Check for high ‘redeliver’ rate for the queue - in the event of network failure (or a node failure), messages can be redelivered. An example is if the consumer dies (its channel is closed, connection is closed, or TCP connection is lost).

If we’re seeing high ‘redeliver’ rates, high ‘ready’ or ‘unacknowledged’ messages then this could indicate an issue with the consumer.

Troubleshooting steps

  1. You could try restarting the application on all the machines of the relevant class. For example, to restart the email-alert-service application, you’d SSH into each email_alert_api machine and restart the app. After restarting, check to see if the problem is solved.

  2. If the issue has not resolved, we should check in the consumers application logs to see if any errors are being thrown.