App healthcheck not ok
Many apps on GOV.UK have a healthcheck endpoint.
The alert works by making a request for the healthcheck endpoint on the machine where the app is running. If you need to test the healthcheck endpoint manually, you can SSH on to the machine and curl
it yourself.
# SSH on to a machine running Content Publisher
gds govuk connect ssh -e integration backend
# Find the port it's running on
ps -ef | grep content-publisher | grep master
# Do the Icinga check manually
curl localhost:3221/healthcheck
Apps with a custom healthcheck endpoint often make use of the generic checks in govuk_app_config. Some apps also implement custom checks, and the alert should link to custom documentation to explain these.
Connection Refused Error
This means the app is not accepting requests for the healthcheck endpoint, and is probably down.
- Check the processes are running.
- Try the healthcheck endpoint manually, as above.
Timeout Error
This means the app is accepting requests, but taking too long to respond (over 20 seconds).
- Try the healthcheck endpoint manually, as above, to confirm.
- Check the logs e.g.
tail -100f /var/log/email-alert-api/app.err.log
. - Check for resource issues e.g. on the Machine dashboard.
Active Record Check
This means the app is unable to connect to its database.
- Check for any RDS alerts.
- Check the AWS RDS dashboard to see if we're experiencing resourcing issues.
- Check for network connectivity to the DB.
# SSH on to a machine with the problem
gds govuk connect ssh -e integration backend
# Find the DB connection details
govuk_setenv content-publisher env | grep -i postgres
# Try to connect
psql -h postgresql-primary -U content-publisher -W content_publisher_production
# Try a command
select * from users;
Sidekiq / Redis Check
This means that the Sidekiq workers can't connect to Redis.
- Check for any Redis alerts.
- Check for network connectivity to Redis.
# SSH on to a machine with the problem
gds govuk connect ssh -e integration backend
# Find the redis connection details
govuk_setenv content-publisher env | grep -i redis
# Try to connect
redis-cli -h backend-redis
# Try a command
keys *
Sidekiq Retry Size Check
This means that Sidekiq jobs are failing.
- Check the Sidekiq 'Retry set size' graph to see if we have a high number of failed jobs.
- Are the workers reporting any problems or any issues being raised in Sentry?
- Check Kibana for Sidekiq error logs (
application: <app> AND @type: sidekiq
).
Sidekiq Queue Latency Check
This means the time it takes for a Sidekiq job to be processed is unusually high.
- Check the Sidekiq 'Queue Length' graph to see if we have a high number of jobs queued up.
- Check the Machine dashboard or the AWS RDS postgres dashboard to see if we're experiencing resourcing issues.
- Are the workers reporting any problems or any issues being raised in Sentry?
- Check Kibana for Sidekiq error logs (
application: <app> AND @type: sidekiq
).