Skip to main content
Last updated: 22 Sep 2022

Reboot a machine

Under normal circumstances, most machines reboot automatically when an update is required. Some machines need to be rebooted manually.

Icinga alerts state when machines need rebooting, and will tell you if it’s a manual reboot and whether it can be done in-hours or should be done out-of-hours.

Automatic rebooting

If machines are not rebooting automatically, there may be a problem with the locking mechanism. See next section.

Checking locking status

locksmith manages unattended reboots to ensure that systems are available. It is possible that a problem could occur where they can’t reboot automatically.

SSH into the machine in question, and run the following command, replacing ‘integration’ with whatever environment the machine is running in:

$ /usr/bin/locksmithctl -endpoint='http://etcd.integration.govuk-internal.digital:2379' status

If a lock is in place, it will detail which machine holds the lock.

You can remove it with:

$ /usr/bin/locksmithctl -endpoint='http://etcd.integration.govuk-internal.digital:2379' unlock '<machine-name>'

Machines that are safe to reboot should then do so at the scheduled time.

Manual rebooting

You can manually reboot virtual machines. You should follow these general rules:

  1. Do not reboot more than one machine of the same class at the same time.
  2. Before you reboot, check whether the machine is safe to reboot, by looking at the machine hieradata in govuk-puppet.
    • If a machine is not safe to reboot, its YAML file will have a govuk_safe_to_reboot::can_reboot property with a value of “no”, “careful” or “overnight” (example).
    • Be sure to check all of the YAML files for your machine class, as the govuk_safe_to_reboot configuration may differ between environments.
    • If there is no govuk_safe_to_reboot configuration, the machine is considered safe to reboot.
    • Even if a safe isn’t considered ‘safe’ to reboot, you may need to do so in the event of an incident. Just be mindful of the downstream effects of the reboot.
  3. Check if there are special instructions below for the machine type you’re rebooting. If there aren’t, then skip to the “rebooting other machines” instructions.

Note that if a reboot gets stuck or takes too long, it can result in AWS automatically terminating that machine. If this happens, AWS should automatically create a new machine to replace the old one.

Rebooting asset_master machines

Unless there are urgent updates to apply the primary machine should not be rebooted in production during working hours - as the primary machine is required for attachments to be uploaded.

The secondary machines can be rebooted as they hold a copy of data and are resynced regularly.

Reboots of the step_down_primary machine should be organised by On Call staff, for the production environment.

You may reboot the primary machine in the staging environment during working hours however it is prudent to warn colleagues that uploading attachments will be unavailable during this period.

Rebooting cache machines

The cache machines run the router app which handles live user traffic. To safely reboot them without serving too many errors to users, we must remove them from the AWS load balancer target groups before rebooting.

The tool for rebooting cache machines is in the govuk-puppet repository. This is the recommended way to reboot a cache machine:

cd govuk-puppet
gds aws govuk-integration-poweruser -- ./tools/reboot-cache-instances.sh -e integration ip-1-2-3-4.eu-west-1.compute.internal

The tool takes an environment and a private DNS name, which are provided by the Icinga alert.

You can also follow this process manually:

  1. Login to the AWS Console for the relevant environment (gds aws govuk-<ENVIRONMENT>-<your-role> -l).
  2. Find the Instance ID of the critical machine(s) (probably all 8 blue-cache machines)
  3. Navigate to the blue-cache Auto Scaling Group (ASG)
  4. Put the instance into a “Standby” state, which will detach the instance from its Target Groups
  5. Once the instance is in a Standby state, SSH onto the machine (find this from the reboots required alert listing)
  6. Check the traffic has reduced to only be the Smokey healthchecks now: tail -f /var/log/nginx/lb-access.log.
  7. Schedule downtime in icinga and run sudo reboot on the remote machine like normal.
  8. In the blue-cache ASG, move the instance back to the “InService” state, which will re-add the instance to the Target Groups
  9. Check the traffic is flowing from the load balancer with tail -f /var/log/nginx/lb-access.log again.

Rebooting ci_agent machines

Sometimes a CI agent starts continually erroring on Jenkins jobs, and the most straightforward way of fixing the issue is to reboot the machine.

First, visit the Jenkins nodes list. Click on the problematic agent and then “Mark this node temporarily offline”. You’ll need to provide a reason, which could just be “to reboot problematic agent”. This will stop the agent from being used for new jobs.

SSH into the agent; the machine number to SSH into will match the agent number. For example:

gds govuk connect ssh -e integration ci_agent:6

Reboot the machine: sudo reboot.

Finally, go back into the Jenkins nodes list to take the node online and then to “Launch agent”. You’ll be taken to the live log for the agent, where you should see the output Agent successfully connected and online.

Rebooting docker_management machines

It is only safe to reboot while no other unattended reboot is underway. This is because it is used to manage locks for unattended reboots of other machines. If this machine is down, then multiple machines in high availability groups may choose to reboot themselves at the same time.

To avoid this happening, we need to disable unattended reboots on all the other machines in the environment while we reboot this one:

  1. Set govuk_unattended_reboot::enabled to false in the govuk-puppet common configuration - you can do this in a branch.
  2. Build the branch of govuk-puppet to Production
  3. Wait half an hour to allow all machines to pull from the puppetmaster
  4. Reboot the docker-management machine (sudo reboot)
  5. Deploy the previous release of govuk-puppet to Production

Rebooting jenkins machines

We only have one instance on each environment that runs Jenkins, therefore rebooting Jenkins will cause downtime for developers. It will prevent developers from being able to deploy code, apply Terraform, run Smokey, and lots of other things.

Before rebooting Jenkins, put a message in #govuk-2ndline-tech and consider doing the same in #govuk-developers. Check that there are no (important) jobs in progress. When things look free, sudo reboot.

Avoid doing this late in the day, as Jenkins is quite brittle and a reboot may cause runtime issues that require SRE assistance. Worse, it may trigger a new instance (see manual rebooting), which can sometimes fail to provision correctly due to the original instance retaining a ‘lock’ on the volume, which then can’t be mounted on the new instance. In this case, terminating the new instance should fix things, as that will cause another new instance to be created, and this time there will be no lock on the volume.

Rebooting mongo machines

Read “find the primary” to figure out which Mongo machine is the primary and which are the secondaries.

All secondary Mongo machines reboot overnight automatically. If you need to reboot them sooner, reboot them one at a time, ensuring that you check that the Mongo cluster is healthy before moving onto the next machine.

To reboot the primary, you’ll need to step the current primary down so that it becomes a secondary machine. You’re then free to reboot it as above.

Rebooting monitoring machines

Before rebooting, post a message in the #govuk-2ndline-tech channel in case anyone is looking at the alerts.

SSH into the machine and run sudo reboot. The Icinga alerts will be temporarily unavailable. Whilst the machine is coming back up this will trigger a number of smokey loop errors which will resolve themselves in half an hour.

Rebooting rabbitmq machines

There are 3 RabbitMQ virtual machines in a cluster. You reboot one machine at a time. You should only reboot the RabbitMQ machines in-hours.

  1. SSH into the machine and environment you want to reboot by running the following command:

    gds govuk connect ssh -e <ENVIRONMENT> <MACHINE>
    

    For example, to SSH into the integration environment of the rabbitmq:1 machine:

    gds govuk connect ssh -e integration rabbitmq:1
    
  2. Check that the RabbitMQ cluster is healthy by running sudo rabbitmqctl cluster_status.

    This prints a list of expected machines and a list of currently running machines. If the 2 lists are the same then the cluster is healthy. The following output is an example of a healthy cluster:

    Cluster status of node 'rabbit@ip-10-12-6-130'
    [{nodes,[{disc,['rabbit@ip-10-12-4-186','rabbit@ip-10-12-5-128',
                'rabbit@ip-10-12-6-130']}]},
     {running_nodes,['rabbit@ip-10-12-4-186','rabbit@ip-10-12-5-128',
                 'rabbit@ip-10-12-6-130']},
     {cluster_name,<<"rabbit@ip-10-12-6-130.eu-west-1.compute.internal">>},
     {partitions,[]},
     {alarms,[{'rabbit@ip-10-12-4-186',[]},
              {'rabbit@ip-10-12-5-128',[]},
              {'rabbit@ip-10-12-6-130',[]}]}]
    
  3. Reboot the machine by running sudo reboot.

When you have rebooted the machine, you should monitor alerts to see if there are any RabbitMQ-related alerts. You might also wish to monitor the cluster via the RabbitMQ web control panel dashboard. This dashboard shows the current members of the cluster and means that you can avoid polling sudo rabbitmqctl cluster_status to determine when your restarted machine has rejoined the cluster.

For more information on RabbitMQ-related alerts, see the GOV.UK Puppet RabbitMQ monitoring.pp file.

There have been two incidents after rebooting RabbitMQ machines. For more information, see the No non-idle RabbitMQ consumers and Publishing API jobs became stuck incident reports.

Rebooting router_backend machines

Router backend machines are instances of MongoDB machines and can be rebooted as per the MongoDB rebooting guidance.

Rebooting other machines

This guidance applies if you want to reboot a machine that is not one of the previous types.

  1. Schedule downtime in Icinga, or let GOV.UK Technical 2nd Line know that there will be alerts for a machine being down.

  2. SSH into the machine and run:

   sudo reboot