AmazonMQ: No consumers listening to queue
Check that there is at least one non-idle consumer for rabbitmq queue {queue_name}
Icinga connects to AmazonMQ’s RabbitMQ admin API to check on the activity of the consumers and that at least one consumer is running for a given RabbitMQ message queue. See here for the plugin that implements the alert.
For information about how we use RabbitMQ, see [here][rabbitmq_doc].
No consumers listening to queue
This check reports a critical error when no consumers are listening to the queue, meaning messages entering the queue will never be processed.
No activity for X seconds: idle times are [X, Y, Z]
This checks whether the RabbitMQ consumers for a queue have been active in the
last 5 minutes. Consumers in an idle
state are listening to the queue but are
unable to process the messages on it.
Publishing API sends a heartbeat message every minute to
the following queues - email-alert-service
and
content_data_api
. This is configured via the queues bindings (e.g
email-alert-service’s binding) matching the routing key
used in the heartbeat. This should ensure that the consumers are
never idle for these queues.
Note
You may see the high unprocessed messages alert too, as issues with consumers processing messages could then lead to a high backlog of messages.
Troubleshooting
Check the AmazonMQ Grafana dashboard
This Grafana dashboard shows activity across
multiple exchanges and queues. The main exchange we expect to be monitoring is
published_documents
which handles broadcasting to services such as search and
email-alert-service when content changes across GOV.UK.
Looking at the queue graphs we should look out for the following:
Check for high ‘ready’ messages for the queue - indicates these messages are waiting to be processed in RabbitMQ by the consumer (e.g email-alert-service).
Check for high ‘unacknowledged’ messages for the queue - implies that messages have been read by the consumer but the consumer has never sent back an ACK to the RabbitMQ broker to say that it has finished processing.
Check for high ‘redeliver’ rate for the queue - in the event of network failure (or a node failure), messages can be redelivered. An example is if the consumer dies (its channel is closed, connection is closed, or TCP connection is lost).
If we’re seeing high ‘redeliver’ rates, high ‘ready’ or ‘unacknowledged’ messages then this could indicate an issue with the consumer.
Troubleshooting steps
You could try restarting the application on all the machines of the relevant class. For example, to restart the email-alert-service application, you’d SSH into each
email_alert_api
machine and restart the app. After restarting, check to see if the problem is solved.If the issue has not resolved, we should check in the consumers application logs to see if any errors are being thrown.