email-alert-api: 9. Sidekiq Lost Job Recovery
We use Sidekiq to do background processing in Email Alert API. This includes critical tasks, such as generating and sending email. However, it's possible for these jobs to be lost:
If the code to enqueue the job in Redis fails, either for transient reasons, or due to a bug.
If a worker process quits ungracefully, while processing a job.
The second issue has occurred frequently due to workers being forcibly restarted when they consume too much memory. A memorable example was the loss of a Travel Advice content change, which was detected by the alert for travel advice emails, and lead to an incident. Conversely, we don't think the first issue occurs frequently.
To help support engineers cope with certain kinds of lost work, we had manual steps to recover it. We had this for the jobs to process content changes and messages. The manual steps involved checking the state of the database, as an indication of whether jobs might have been lost. However, it's possible checking the database will turn up false positives e.g. if the system is running slowly. Both workers would double check to prevent accidentally duplicating work.
Ideally these issues wouldn't exist in the first place. Previously we considered upgrading to Sidekiq Pro, which persists jobs in Redis while they are being processed. Using Sidekiq Pro would therefore prevent the second issue, but not the first, which is unrelated to Sidekiq. It's also worth noting that we previously had concerns about switching to Sidekiq Pro, thinking it would be hard to use in practice. Still, it's possible the benefits could outweigh the drawbacks.
We decided to implement a new worker to automatically recover all work associated with generating and sending email. This resolves both of the above issues with lost work, and codifies the previously manual recovery steps. Since there is little urgency around sending email, we decided to run the worker only infrequently (currently every half hour), for work that's over an hour old - we expect most work to have been processed within an hour.
An edge case for recovery is the initiation of the daily and weekly digest runs. If one of these scheduled jobs is lost before it can create a DigestRun record, our normal strategy of checking the database won't work. For this scenario, we decided to have a separate recovery strategy that's coupled to the schedule for the initiator jobs. The strategy involves looking back over the previous week to see if any DigestRun records are missing for that period.
We decided to pursue recovering lost work instead of acquiring Sidekiq Pro, which would help avoid the loss in the first place. We had the impression that the process to pay for, acquire and integrate Sidekiq Pro into Email Alert API would take longer to complete than implementing our own solution.
Potential for duplicate work
As mentioned in the context, it's possible we may recover work that still exists in the system. For example, a job that's over an hour old may simply be delayed due to an unusually high backlog in its Sidekiq queue. Although we could check the state of Sidekiq as part of finding lost work, it's still possible to have race conditions where we falsely requeue work that's not lost. To cope with this, we modified each worker so that it's idempotent.
Each worker will double check if it has previously completed. However, this could still be a false positive if duplicate jobs end up running concurrently. We had two approaches to cope with this:
For fast, high volume jobs, like sending an email, we used a blocking, row-level lock inside a transaction. Using a row-level lock is faster because the lock is part of the
FOR UPDATE) statement we already execute to fetch the Email record. This is better for the common case, where there is no duplicate work. Even if duplicate work exists, it will only occupy a worker for a short time, until the original job completes and the lock is released.
While making jobs idempotent means the system will behave correctly in the long term, in the short term it's still possible for the recovery to generate an alarming amount of "no-op" work on a queue e.g. if the system is running slowly. This is a particular concern for jobs to send email, where the queue latency can exceed one hour, with many thousands of fresh jobs in the queue. We used SidekiqUniqueJobs to prevent a snowball effect in this scenario.
Competing with Sidekiq retries
In creating the recovery worker we realised it would be in competition with the retries we have setup for jobs (or by default). Sidekiq retries are faster than recovery, so it makes sense to continue using them.
However, the recovery worker could still generate a relatively large backlog of retrying work, if a job is perpetually failing due to a bug. We have tried to mitigate this by limiting the number of retries for each job, but we could also consider using the unique jobs approach for emails if this isn't enough. In the long term, we also plan to improve the alerts for the system, so that we can intervene in good time if work appears to be perpetually failing.
Delay in processing lost work
In the worst case, it could take up to 1.5 hours to recover a piece of lost work. This is something we could tune if necessary e.g. for transactional emails, which should be sent within 15 minutes. We plan to consider this as part of our work on improving the alerts for the system. We also plan to reconsider using Sidekiq Pro as a more lightweight solution to speedy recovery, with the recovery worker as a fallback.