Low available disk space
See also: low available disk inodes
Low available disk space on root
You can try clearing out the APT cache:
sudo apt-get clean
Low available disc space on /mnt/uploads for asset-master-1 and asset-slave-1
Nagios is set up to monitor disc usage by /mnt/uploads on asset-master-1.backend and asset-slave-1.backend. Usually, the culprit behind space-chomping is /mnt/uploads/whitehall, and there are two folders on which to focus clean-up efforts on:
/mnt/uploads/whitehall/attachment-cache
/mnt/uploads/whitehall/carrierwave-tmp
Running the following command from within the above folders (ie cd to /mnt/uploads…) usually works best as it means that we don’t end up with the parent folders being deleted:
find -type d -ctime +1 -exec rm -rf {} +
In the above example, we:
find -type d
- find all directories within the folder you are currently cd’d in to-ctime +1
- limit the results of the abovefind
to files created over a day (24hrs) ago-exec rm -rf {} +
- execute a force removal on the folders found, with the above conditions
The + is a gnuism which allows the command to run more efficiently. This man page explains the + functionality:
This variant of the -exec option runs the specified command on the selected files, but the command line is built by appending each selected file name at the end; the total number of invoca-tions of the command will be much less than the number of matched files. The command line is built in much the same ways that xargs builds its command lines. Only one instance of ‘{}’ is allowed within the command. The command is executed in the starting directory.
This command will take a while to complete, depending - of course - on how large your two starting folders are. After it has completed, push Nagios to re-check the service on each affected host.
Low available disc space on /mnt/crawler_worker on mirrorer
This can be caused when the AWS autoscaler re-spins a VM and the newly created
VM does not attach the secondary EBS volume properly. In this case, the alert
will say DISK CRITICAL: /mnt/crawler_worker not found
.
Cause
An EBS volume can only be attached to one VM at a time. When the autoscaler terminates a VM, AWS disconnects the EBS volume and releases a lock, whilst at the same time also spinning up the replacement VM. The new VM can spin up and be provisioned faster than the lock can be released.
Solution
Manually re-attach the EBS volume tagged with mirrorer to the mirrorer instance, then run puppet on the instance. This will cause puppet to correctly mount the EBS volume in /mnt/crawler_worker.
- Login to the AWS Web Console:
gds aws govuk-integration-poweruser -l
- Click Services -> EC2
- Change region to “Europe (Ireland) eu-west-1”
- Click Instances on the left hand side menu bar and filter by “mirrorer”
- Copy the Instance ID, it should look similar to
i-06887a467af1266a0
- Left hand side menu, expand Elastic Block Store, and click Volumes
- Filter by “mirrorer” and ensure that the state of the only result is “available”
- Right click volume and choose “Attach Volume”
- Paste the Instance ID in to the appropriate box, leaving Device as the default
- Click Attach
- After AWS finishes attaching the volume to the instance the state should change to “in-use”
- Once it has settled to the “in-use” state, SSH in to the mirrorer instance and run
govuk_puppet --test
No disk space in general
If you cant find any of the above cases causing low disk spaces, it can
be useful to run the command sudo ncdu <mountpoint>
, where mountpoint
will be the path specified in the Nagios alert. It will calculate all
the disk space being used, and then bring you into an ncurses file
explorer-style interface. Press enter to go into a directory, and d to
delete any large files.
Press q to quit, or ? for help.
If a particular mountpoint is rather large, it may be helpful to first
scan it to see which directory inside it is using the most space, then
use ncdu on that directory. Do it with the following command:
sudo du -hc --max-depth=1 <mountpoint>
If you are able to determine that the machine simply requires more disk space to run the applications (such as a growing database) you can increase capacity by:
- Growing a persistent volume in case of a stateful app
- If the app is stateless and the volume ephemeral you can resize it in govuk-aws-data at data/<app>/<env>/common.tfvars and modify root_block_device_volume .You will then have to terminate the instances for them to be respawned by the autoscaling group with the new resized volume.
Disk space used by logfiles
You may discover that disk space is mostly being used up by logfiles. Some work has been done to logrotate to make it aware of disk space, and to delete every log file over a certain size, in reverse chronological order, until disk space is brought below the threshold. If logfiles continue to cause issues for disk space, you should look into tweaking the thresholds for the above code first.
logrotate sometimes fails to properly rotate a logfile for whatever reason, so it can just be one or two of the latest logfiles using a high proportion of the disk space.
If it is out of hours, free up room to resolve the incident without performing the maintenace tasks described below. Pass to 2ndline to deal with during working hours.
Firstly see if you can remove any old logfiles, where dir is the
directory where the large file lives: find <dir> -mtime +45 -type f
(for
files over 45 days old, for example).
You can remove them by running: find <dir> -mtime +45 -type f -delete
Ideally see if you can copy the large file elsewhere (on a different
mount point which has free space) and then entirely truncate that log:
truncate -s 0 <logfile>
. Manually gzip the file (gzip <file>)
and put it back in the log file directory.
You can force a logrotate run afterwards:
sudo logrotate -vf /etc/logrotate.d/<type of logs>
If you have nowhere to move the logs and it is causing an incident, uptime of the server takes precedence over logfiles, so just truncate the logs to free up space
Low available disk space on Jenkins
One possible cause of this is that the /var/lib/docker
directory is consuming
a large amount of disk space. This has been found to happen on the ci-agent
machines.
Verify this:
$ cd /var/lib
$ sudo ncdu
If /var/lib/docker
is consuming a large amount of disk space run the
following as root:
$ docker rmi $(docker images --filter "dangling=true" -q --no-trunc)
$ docker volume rm $(docker volume ls -qf dangling=true)
This will remove ‘dangling’ images and volumes. A comprehensive set of instructions can be found in this Docker resource cleanup gist.