Many of our applications use Sidekiq for background job processing.
There’s a GOV.UK wrapper that will help you set it up.
If an alert fires a good place to start investigation is the Sidekiq monitor.
Sidekiq has in built retry logic (turned on by default, but configurable). Middleware is used to send metrics (successes, failures, job timings and retry counts) to statsd, which forwards the data to Graphite to be stored. Information about viewing this can be found on the Monitor Sidekiq workers page.
Jobs do fail, this is not inherently bad and can happen for a number of reasons. When a job fails it gets retried with an exponential backoff (up to 21 days), as long as retries are enabled. A high number of retries signifies a bigger, less transient problem maybe occuring.
More about Monitoring
- Add a deployment dashboard for an application
- Add an Icinga passive check to a Jenkins job
- Add sidekiq-monitoring to your application
- Error reporting with Sentry
- GOV.UK and Virtual Private Networks (VPNs)
- Graphite and deployment dashboards
- How to deal with errors
- Monitor Sidekiq queues for your application
- Monitoring screens
- Nagios NRPE connection failures
- Pingdom Bouncer canary check
- Tools: Icinga, Grafana and Graphite, Kibana and Fabric
- Uptime Metrics
- Use AWS X-Ray to trace app requests
- Use Terraboard to monitor Terraform state