We have a few monitoring checks for Redis:
- memory usage
- number of connected clients
- list length (for the
Redis is configured to use 75% of the system memory. If Redis memory usage reaches this limit it will stop accepting data input. Redis is also unable to write its RDB data dumps when it’s out of memory, so data loss is possible.
Since version 2.6, Redis dynamically sets the maximum allowed number of clients to 32 less than the number of file descriptors that can be opened at a time.
General Redis investigation tips
If Redis is using large amounts of memory you could use
to find out what those things might be.
Redis for logs
Investigation of the problem
For logs-redis, memory may be exhausted if the Elasticsearch river is not consuming data from Redis for some reason. In this case, redis will accumulate data up to its memory limit and then stop accepting log data. In this condition log data will be lost from the logging pipeline because logship discards data.
- You should use graphite for looking at CPU load and used memory. (Pending: More graphite inputs)
- Also check how this impacts Elasticsearch.
redis-cli CLIENT LISTcommands help to see present info. Other commands may come handy.
Redis rivers for Elasticsearch
We use elasticsearch-redis-river
(a Redis river for Elasticsearch). This is a special process which reads data continuously from a redis queue (called
logs) into Elasticsearch.
There is a Nagios alert (listed above) for when the
logs Redis list gets
too long. This might be because Elasticsearch is unavailable or may be for
some other reason.
Generally this can be fixed by deleting and recreating the rivers. This is safe to do because the river pulls data from Redis (rather than redis pushing data into Elasticsearch).
Delete and recreate the rivers with this Fabric command:
fab $environment -H logs-elasticsearch-1.management elasticsearch.delete:_river puppet
elasticsearch.delete:_river command deletes all rivers, and
runs Puppet which will recreate them.
To manually check the length of the list, use:
fab $environment -H logs-redis-1.management do:'redis-cli LLEN logs'
More about Icinga alerts
- App healthcheck failed
- App isn't running the expected Ruby version
- Asset master and slave disk space comparison
- Asset master attachment processing
- AWS SES quota usage higher than expected
- Backup passive checks
- Benchmark search queries failed
- Check for spelling suggestions failed
- Check that correct users have access
- ClamAV definitions out of date
- Data sync
- Defined CPU type does not match
- duplicate SSH host keys
- Elasticsearch cluster health
- Elasticsearch: not receiving Syslog from Logstash
- Email alerts not sent
- Enhanced ecommerce data export
- Fastly error rate for GOV.UK
- Fetch analytics data for search failed
- Fix stuck virus scanning
- Free memory warning on backend
- high disk time
- High memory for application
- high zombie procs
- Logs are not being received from the CDN
- logspout running
- Low available disk inodes
- Low available disk space
- mongod replication lag
- MongoDB rollback
- MySQL Xtrabackups to S3
- MySQL: replication lag
- MySQL: replication running
- Nginx 429 too many requests
- Nginx 5xx rate too high for many apps/boxes
- Nginx high conn writing -- upstream indicator Check
- Nginx requests too low
- ntp drift too high
- Offsite backups
- Onsite backups failed
- Outstanding security updates
- PagerDuty drill
- Pingdom check for homepage failing
- Pingdom search check
- PostgreSQL: replication too far behind
- PostgreSQL: S3 Backups
- Prolonged GC collection times
- publisher app healthcheck not ok
- Puppet last run errors
- RabbitMQ: high watermark has been exceeded
- RabbitMQ: No consumers listening to queue
- reboot required by apt
- rkhunter warnings
- root filesystem is readonly
- Search reindex failed
- Unicorn Herder
- Varnish port not responding
- VPN down
- Whitehall app healthcheck not ok
- Whitehall scheduled publishing