Prolonged GC collection times
This checks when the Elasticsearch JVM garbage collection times (in milliseconds) exceeds critical or warning levels. This is collected by graphite via collectd from the Elasticsearch API.
Currently the check uses graphite function to summarise over a time period of 5 minutes and find the maximum value in that period.
You can find the current value using curl if you SSH into the affected box:
You need to look for the
collection_time_in_millis. There will be two
young. Both are checked by Nagios and correspond to
different portions of the JVM heap. The lower these times are, the
better. Another important value is
heap_used_percent, again this
should be low. If it gets too high it may prevent garbage collection
Solving this problem largely depends on what the particular box is being used for.
On boxes where the data in Elasticsearch isn’t critical (e.g. for
ci-agent, where it is only test data) freeing up heap
space can be achieved by deleting indexes:
curl -XDELETE localhost:9200/<index name>
to delete a specific index, or:
curl -XDELETE localhost:9200/_all
to delete all indexes. Obviously run these with extreme care!
Places to investigate?
- Make sure this is not affecting the gov.uk site search if coming from Elasticsearch search boxes.
- Make sure not leading to loss of log lines if coming from Elasticsearch logging boxes.
If you are still struggling these links might help but they are very in-depth.