Many of our applications use Sidekiq for background job processing.
There’s a GOV.UK wrapper that will help you set it up.
If an alert fires a good place to start investigation is the Sidekiq monitor.
Sidekiq has in built retry logic (turned on by default, but configurable). We graph Sidekiq job stats: successes, failures, job timings and retry counts. These can be found under the statsd Graphite namespace. i.e.:
stats.gauges.govuk.app.support.workers.retry_set_size stats.govuk.app.*.workers.*.failure stats.govuk.app.<app_name>.workers.<worker_name>.failure stats.govuk.app.whitehall.workers.SearchIndexAddWorker.failure
Rummager worker data use a different namespace and can be accessed at:
Jobs do fail, this is not inherently bad and can happen for a number of reasons. When a job fails it gets retried with an exponential backoff (up to 21 days), as long as retries are enabled. A high number of retries signifies a bigger, less transient problem maybe occuring.
More about Tools
- Access apps on the shared Heroku account
- Add a new document type
- Add a new Ruby version
- Configure a GitHub repo
- GitHub Enterprise
- Graphite and deployment dashboards
- How we use GitHub
- Monitoring screens
- Pact Broker
- Run an A/B or multivariate test
- SSH Configuration
- Tools: Icinga, Grafana and Graphite, Kibana and Fabric
- Update Hubot (Slack bot)