Reboot a machine
Under normal circumstances, most machines reboot automatically when an update is required. Some machines need to be rebooted manually.
Icinga alerts state when machines need rebooting, and will tell you if it’s a manual reboot and whether it can be done in-hours or should be done out-of-hours.
If machines are not rebooting automatically, there may be a problem with the locking mechanism. See next section.
Checking locking status
locksmith manages unattended reboots to ensure that systems are available. It is possible that a problem could occur where they can’t reboot automatically.
SSH into the machine in question, and run the following command, replacing ‘integration’ with whatever environment the machine is running in:
$ /usr/bin/locksmithctl -endpoint='http://etcd.integration.govuk-internal.digital:2379' status
If a lock is in place, it will detail which machine holds the lock. If you want to check which machine created the lock, you can search for it in AWS. If the machine doesn’t exist, it may have been terminated before it could release the locks. The locking machine is almost always a docker_management machine, so rebooting the docker management machine will cause it to refresh and clear out-of-date locks. If the docker-management machine needs rebooting anyway, you should try this first.
If you need to manually remove the lock, you can remove it with:
$ /usr/bin/locksmithctl -endpoint='http://etcd.integration.govuk-internal.digital:2379' unlock '<machine-name>'
Machines that are safe to reboot should then do so at the scheduled time.
You can manually reboot virtual machines. You should follow these general rules:
- Do not reboot more than one machine of the same class at the same time.
- Before you reboot, check whether the machine is safe to reboot, by looking at the machine hieradata in govuk-puppet.
- If a machine is not safe to reboot, its YAML file will have a
govuk_safe_to_reboot::can_rebootproperty with a value of “no”, “careful” or “overnight” (example).
- Be sure to check all of the YAML files for your machine class, as the
govuk_safe_to_rebootconfiguration may differ between environments.
- If there is no
govuk_safe_to_rebootconfiguration, the machine is considered safe to reboot.
- Even if a safe isn’t considered ‘safe’ to reboot, you may need to do so in the event of an incident. Just be mindful of the downstream effects of the reboot.
- If a machine is not safe to reboot, its YAML file will have a
- Check if there are special instructions below for the machine type you’re rebooting. If there aren’t, then skip to the “rebooting other machines” instructions.
Note that if a reboot gets stuck or takes too long, it can result in AWS automatically terminating that machine. If this happens, AWS should automatically create a new machine to replace the old one.
Unless there are urgent updates to apply the primary machine should not be rebooted in production during working hours - as the primary machine is required for attachments to be uploaded.
The secondary machines can be rebooted as they hold a copy of data and are resynced regularly.
Reboots of the step_down_primary machine should be organised by On Call staff, for the production environment.
You may reboot the primary machine in the staging environment during working hours however it is prudent to warn colleagues that uploading attachments will be unavailable during this period.
cache machines run the
router app which handles live user traffic.
To safely reboot them without serving too many errors to users, we must
remove them from the AWS load balancer target groups before rebooting.
The tool for rebooting cache machines is in the govuk-puppet repository. This is the recommended way to reboot a cache machine:
cd govuk-puppet gds aws govuk-integration-poweruser -- ./tools/reboot-cache-instances.sh -e integration ip-1-2-3-4.eu-west-1.compute.internal
The tool takes an environment and a private DNS name, which are provided by the Icinga alert.
You can also follow this process manually:
- Login to the AWS Console for the relevant environment (
gds aws govuk-<ENVIRONMENT>-<your-role> -l).
- Find the Instance ID
of the critical machine(s) (probably all 8
- Navigate to the
blue-cacheAuto Scaling Group (ASG)
- Put the instance into a “Standby” state, which will detach the instance from its Target Groups
- Once the instance is in a Standby state, SSH onto the machine (find this from the reboots required alert listing)
- Check the traffic has reduced to only be the Smokey healthchecks now:
tail -f /var/log/nginx/lb-access.log.
- Schedule downtime in icinga and run
sudo rebooton the remote machine like normal.
- In the
blue-cacheASG, move the instance back to the “InService” state, which will re-add the instance to the Target Groups
- Check the traffic is flowing from the load balancer with
tail -f /var/log/nginx/lb-access.logagain.
Sometimes a CI agent starts continually erroring on Jenkins jobs, and the most straightforward way of fixing the issue is to reboot the machine.
First, visit the Jenkins nodes list. Click on the problematic agent and then “Mark this node temporarily offline”. You’ll need to provide a reason, which could just be “to reboot problematic agent”. This will stop the agent from being used for new jobs.
SSH into the agent; the machine number to SSH into will match the agent number. For example:
gds govuk connect ssh -e integration ci_agent:6
Reboot the machine:
Finally, go back into the Jenkins nodes list to take the node online and
then to “Launch agent”. You’ll be taken to the live log for the agent,
where you should see the output
Agent successfully connected and online.
It is only safe to reboot while no other unattended reboot is underway. This is because it is used to manage locks for unattended reboots of other machines. If this machine is down, then multiple machines in high availability groups may choose to reboot themselves at the same time.
To avoid this happening, we need to disable unattended reboots on all the other machines in the environment while we reboot this one:
falsein the govuk-puppet common configuration - you can do this in a branch.
- Build the branch of govuk-puppet to Production
- Wait half an hour to allow all machines to pull from the puppetmaster
- Reboot the docker-management machine (
- Deploy the previous release of govuk-puppet to Production
We only have one instance on each environment that runs Jenkins, therefore rebooting Jenkins will cause downtime for developers. It will prevent developers from being able to deploy code, apply Terraform, run Smokey, and lots of other things.
Before rebooting Jenkins, put a message in
#govuk-2ndline-tech and consider doing the same in
Check that there are no (important) jobs in progress. When things look free,
Avoid doing this late in the day, as Jenkins is quite brittle and a reboot may cause runtime issues that require SRE assistance. Worse, it may trigger a new instance (see manual rebooting), which can sometimes fail to provision correctly due to the original instance retaining a ‘lock’ on the volume, which then can’t be mounted on the new instance. In this case, terminating the new instance should fix things, as that will cause another new instance to be created, and this time there will be no lock on the volume.
Read “find the primary” to figure out which Mongo machine is the primary and which are the secondaries.
All secondary Mongo machines reboot overnight automatically. If you need to reboot them sooner, reboot them one at a time, ensuring that you check that the Mongo cluster is healthy before moving onto the next machine.
To reboot the primary, you’ll need to step the current primary down so that it becomes a secondary machine. You’re then free to reboot it as above.
Before rebooting, post a message in the
#govuk-2ndline-tech channel in case anyone is looking
at the alerts.
SSH into the machine and run
sudo reboot. The Icinga alerts will be temporarily unavailable.
Router backend machines are instances of MongoDB machines and can be rebooted as per the MongoDB rebooting guidance.
Rebooting other machines
This guidance applies if you want to reboot a machine that is not one of the previous types.
Schedule downtime in Icinga, or let GOV.UK Technical 2nd Line know that there will be alerts for a machine being down.
SSH into the machine and run: