2nd line drills

Last updated: 22 Jul 2024

There are a number of areas that are important to drill on 2nd line and include some tasks you may not necessarily encounter in your mission team. We want to ensure developers have the opportunity to practise these tasks ahead of the real thing and in preparation of going on call if you are part of the out of hours rota.

Follow the Deploy an emergency banner on Staging.

You’ll need to choose a non-serious and clearly fake news headline. For example:

CAMPAIGN_CLASS: Death of a notable person
HEADING: Henry Fielding dies
SHORT_DESCRIPTION: English novelist and dramatist known for his earthy humour and satire dies, age 47
LINK: https://en.wikipedia.org/wiki/Henry_Fielding
LINK_TEXT: More information

Use a restored database in an app

On Integration or Staging, follow the Restore an RDS instance via the AWS CLI instructions for an app of your choice.

Force failover to GOV.UK mirror and Emergency publishing using the GOV.UK mirror

Warn in #govuk-2ndline-tech that you’re about to do this, as it will lead to a spike in alerts and will also break continuous deployment for a while (due to Smokey failures).
Follow the Forcing failover to the GOV.UK mirrors instructions on Integration or Staging.
To verify that it worked, visit a page at random and purge the page from cache. Reload the page, to see the ‘mirrored’ version of the content. NB: you wouldn’t do this in a real incident, as we’d want to serve Fastly’s cached version for as long as possible.
Undo your changes to have Nginx handling requests again.

Drill logging into accounts

Make sure you can log into the following:

The AWS web console
Fastly (using your individual account)
Statuspage (using your individual account)
Logit (using your individual account)
Heroku
data.gov.uk CKAN
Rubygems (2ndline/rubygems in Secrets Manager)
NPM (2ndline/npm in Secrets Manager)

For more information about accessing shared credentials follow the Retrieve shared credentials from AWS Secrets Manager doc.

Drill scaling up an application

In preparation for a large spike in traffic, you can increase the number of replicas for an app.

Pick an application and try scaling it up in staging. Don’t forget to revert your change afterwards.

Example PR - Increase content store replica count in staging

Drill 2nd line incident processes

Drill an end to end incident

Decide on a hypothetical incident scenario, e.g. “GOV.UK is down”. Walk through the incident management guidance. Use common sense when following the steps (i.e. don’t actually publish an incident to Statuspage or email stakeholders).

Drill how to communicate when Slack is down

Ensure you know how to communicate with your 2nd line colleagues if Slack is unavailable. See Communicate when Slack is unavailable for details.

Drill special deployment conditions

Deploy from AWS CodeCommit when Github is unavailable

Warnings

Please run the following drill in integration. Some steps in the guide below refer to production, be mindful and choose the correct environment for the drill.
Please choose a low impact deployable application that has a Dockerfile. The drill does not cater for repos that are dependencies such as gems and npm packages.
This drill requires a pause of Continuous Deployment for all applications deployed through ArgoCD for about 1 hour.

Steps

Send a slack message to an announcement channel like #govuk-developers to schedule a time to run the drill.

An example message:

@here this week we're going to be drilling deploying from AWS CodeCommit without using GitHub on 2ndline.
Part of this drill involves disabling CD for all applications in all environments for ~1 hour.
We're planning on actioning the step of the documentation that disables auto-sync on Argo on <Wednesday at 2.30pm>.

Please let us know if this timing will cause you any issues and we can reschedule.
If there's no objections to the time i'll update the channel shortly before disabling and re-enabling CD.

Half an hour before scheduled time, follow the Deploy when GitHub is unavailable instructions, stopping at the “Deploy image to ArgoCD” step.
At the scheduled time, post a message on slack to remind people about CD being disabled shortly.
Continue the guide from the Deploy image to ArgoCD step

Drill enabling a code freeze

Choose a continuously-deployed app where you can make a meaningful change to the default branch, such as fixing a typo or merging a Dependabot PR.

Before merging the change, implement a deployment freeze for that app.

View the application’s page in Argo CD in each environment to see whether a deployment happened or not.

Remove the code freeze, then make sure the current version is deployed to all three environments.

Drill making changes to user accounts

Assign a user to their publisher in data.gov.uk

Log into our shared data.gov.uk publisher account. Pick a publisher to do a hypothetical walk through of assign users to publishers.

Change a user’s permissions in Signon

Carry out a hypothetical walk through of unsuspending a user and resetting a user’s 2FA.

Drill creating and changing redirects

Redirect a route

On Integration or Staging, follow the Removing a route created in the Short URL Manager and Removing a route completely so it can be replaced with another route instructions.

Change a slug and create a redirect

On Integration or Staging, follow the Change a slug and create redirect in Whitehall, picking something at random in Whitehall from one of the group of entities listed (people, role, organisation, etc).

Drill modifying a document’s change note

Modify and remove a document’s change note in Whitehall

On Integration or Staging, follow Modify a change note in Whitehall using this document or one of your choice. Once you have successfully updated the change note you can drill removing a change note in Whitehall.

Modify and remove a document’s change note in Content Publisher

On Integration or Staging, follow Modify a change note in Content Publisher using this document. The 30th November 2021 shows a bespoke change note which you could try changing - click “show all updates” at the bottom of the page.

You can also try deleting the change note. Again, ensure you do this on Staging or Integration.

Modify a document’s change note in Publishing API

On Integration or Staging, follow Modify a change note in Publishing API using this document or one of your choice. Once you have successfully updated the change note you can drill removing a change note in Publishing API.

Drill making changes to the homepage

Drill updating homepage popular links

Change the homepage popular links following Update popular links. Open a draft PR, and deploy your branch to integration. Once deployed, check your change and redeploy the previous branch to integration.

Drill updating homepage promotion slots

Follow the Update homepage promotion slots instructions, using an appropriate image and text. Open a draft PR, and deploy your branch to integration. Once deployed, check your change and redeploy the previous branch to integration.

Drill CDN failover

Warn in #govuk-2ndline-tech that you’re about to do this, because our failover CDN does not have full feature parity with our primary one.
Ensure that you are connected to the VPN before starting.
Follow the Fall back to AWS CloudFront instructions for staging only.
Check if GOV.UK Staging still works correctly after performing the failover.
Revert your changes when finished.