Fix stuck virus scanning
Documents uploaded to asset-master are scanned asynchronously through a virus scanner, as explained below.
If the number of documents is too high, users can experience long waiting times until they see the documents available. There is a Grafana dashboard that helps to visualise the number of documents that have been processed and the waiting time since the file has been placed on disk until it is scanned.
Virus scanning process
All files uploaded through Whitehall (including both attachment documents and
images) are asynchronously run through a virus scanner (ClamAV) before being
made available to view. This process runs on the asset master
The AV scan process is as follows:
- Publisher uploads file, which gets written to a sub-directory within
/mnt/uploads/whitehall/incoming/, with the full path based on the type of the upload and the ID of the edition being edited.
- Every minute, a cron job runs as the
assetsuser and triggers the process-uploaded-attachments.sh script.
- Each file currently found within the
incomingdirectory is scanned using ClamAV via the virus-scan-file.sh script.
- If the file is clean, it is moved to the
/mnt/uploads/whitehall/clean/directory, and is then available for users to view. It also writes the file name to a list, which gets put into a temporary directory. If the file is found to have an infection, it is moved to the
- Independent to this process, another cronjob runs every minute and triggers the copy-attachments-to-slaves script. This script copies each file found in the list (produced by the previous task) to each asset slave and an Amazon S3 bucket.
All scripts write to syslog, so you can check on the current processing as follows:
$ tail -f /var/log/syslog | grep "process_uploaded_attachment\|virus_scan\|copy_attachment"
Detecting new viruses
A separate script regularly rescans all the previously uploaded files in both clean and draft-clean to catch any newly-released virus signatures.
The script is configured in cron to run every hour, but actually takes over 2 days to complete. It starts under a lock so that only one scan runs at a time.
Quickly process a backlog of files awaiting AV scan
The AV scan process is currently quite slow (usually taking between 10-12 seconds per file), and we get a lot of files to check. If there’s a large backlog, publishers can be waiting for hours for their files to be scanned. You can scan everything in the backlog in one go as follows:
$ find /mnt/uploads/whitehall/incoming -type f | xargs clamscan
This shouldn’t take more than a minute, unless there’s a huge (>500 files)
backlog. The reason this is so much quicker than the per-file scan is that
we’re only starting
clamscan once, rather than once per file.
The output of that scan will tell you how many files are infected. Assuming
they are all clean (
Infected files: 0), you can make all these available
immediately by running:
$ sudo -u assets rsync -rav /mnt/uploads/whitehall/incoming/* /mnt/uploads/whitehall/clean/
Note that this doesn’t actually clear the queue: it simply shortcuts it. Each file will still be scanned individually, and then copied to the asset slaves appropriately. This means that any new uploads will still be at the back of the queue.
In order to clear the queue, you can copy the files to a temporary directory:
$ sudo -u assets rsync -rav /mnt/uploads/whitehall/incoming/* /mnt/uploads/whitehall/temp/
and then remove the files from
/mnt/uploads/whitehall/incoming/, so that there’s
a clean distinction between the files that have been copied and those that are
newly uploaded. Next, scan the copied files manually:
$ find /mnt/uploads/whitehall/temp -type f | xargs clamscan
Then, if they are all clean, copy to
$ sudo -u assets rsync -rav /mnt/uploads/whitehall/temp/* /mnt/uploads/whitehall/clean/
Finally, remember to delete your temporary directory!