Many of our applications use Sidekiq for background job processing.
There’s a GOV.UK wrapper that will help you set it up.
Sidekiq has in built retry logic (turned on by default, but configurable). We graph Sidekiq job stats: successes, failures, job timings and retry counts. These can be found under the statsd Graphite namespace. i.e.:
stats.gauges.govuk.app.support.workers.retry_set_size stats.govuk.app.*.workers.*.failure stats.govuk.app.<app_name>.workers.<worker_name>.failure stats.govuk.app.whitehall.workers.SearchIndexAddWorker.failure
Rummager worker data use a different namespace and can be accessed at:
Jobs do fail, this is not inherently bad and can happen for a number of reasons. When a job fails it gets retried with an exponential backoff (up to 21 days), as long as retries are enabled. A high number of retries signifies a bigger, less transient problem maybe occuring.
If an alert fires a good place to start investigation is the Sidekiq monitor.