Out of hours support (on-call)
See So, you’re having an incident!
This page is for before you go on call to prep for incident handling.
GOV.UK developers are part of an on-call rota to keep GOV.UK running at night, on the weekends and on public holidays.
Developers are drafted on the on-call rota after:
- they have completed 2 shadow 2nd line shifts
- they have completed 2 in-hours Secondary 2nd line shifts
They will be automatically opted into the rota the same time they join the Primary 2nd line rota. If a developer is unable to be rota’d on-call, then they need to request to opt out by contacting their tech leads. This will only be granted if they have a strong reason be exempt (e.g. health issues, caring responsibilities).
On call checklist
You should do these things before going on call so you’re prepared.
- Have the numbers of other people on your shift saved in your phone. This includes whoever is on Escalations. Get these numbers from PagerDuty.
- Make sure you know how to contact the rest of SMT, if the on-call SMT is unavailable.
- Ensure you have an up to date local copy of the Developer Docs repository and that you can build it.
- Make sure you can use ssh to access machines
- Make sure you can access AWS (using the web console and the
- Make sure you can access GCP (using the web console and the
- Make sure you can VPN to the office or disaster recovery location.
- Ensure your PagerDuty alert settings will wake you if you’re called. You might want to install the PagerDuty App on your phone and send a test notification.
- Ensure you can decrypt secrets with your GPG setup.
- Ensure you have single-sign-on set up for GOV.UK PaaS (instructions on setting up single-sign-on)
- Ensure you can access the
govuk_developmentorganisation in GOV.UK PaaS from the command line (instructions for setting up the cloud foundry command line).
- Read these documents:
The steps above are outlined in the On call template Trello card, which developers should drill when given Production Admin access. Developers should speak to the 2nd line tech lead(s) if they have any issues with the above steps.
You may also want to set to be notified for every new message in the #govuk-incident Slack channel, but this is strictly optional. People should not expect to be able to contact you on Slack during your shift. You can change your Slack notification settings by clicking “Change notifications” and selecting “All new messages”.
Things that may result in you being contacted
We use PagerDuty for automated monitoring. You can update your notification rules in your PagerDuty account to notify you however you want (phone call, SMS, email, push notification). There are 2 ways that this might contact you:
Any Icinga checks that use
govuk_urgent_priority will cause PagerDuty to be notified:
- Travel advice emails not going out
- Overdue publications in Whitehall
- Scheduled publications in Whitehall not queued
- High nginx 5xx rate for www-origin on cache machines
- varnishd port not responding
You can get the most up to date list of these by searching the Puppet repo for govuk_urgent_priority.
We have downtime checks configured in Pingdom which notify Pagerduty directly rather than using GOV.UK’s internal monitoring. They are all configured in Pingdom to:
- be considered down after 30 seconds
- a check interval of 1 min
- send an alert after 5 mins
They are useful when network access to all machines running GOV.UK is down. These are set up for key parts of the website such as:
- Assets (assets.publishing.service.gov.uk)
- Bouncer canary
- GOV.UK homepage
- S3 mirror (London) and S3 Mirror Replica (Ireland)
Phone calls from people
Senior members of GOV.UK may phone you if they’ve been contacted by other parts of government. These phone calls will generally come from the group that is on the rota for the ‘Escalations’ contact number.
The GOV.UK on-call escalations contact will call you to carry this out. See the deploy an emergency banner doc for more information.
Updating the homepage
You might be asked to update the homepage promotion slots to highlight important information on GOV.UK.
Responding to being contacted
If you’re available to investigate the problem, acknowledge the alert in PagerDuty to prevent the next person being phoned.
Try to diagnose what the problem is. If you’re comfortable that you understand the problem there’s no need to escalate to the next person. If you’re not sure you completely understand what’s going on, it’s better to escalate the alert in PagerDuty.
If you escalate a problem, stay online to support the other person and to increase your understanding of what’s going on. If a problem is escalated to you, explain what you’re doing to the person who escalated to you.
If the technical people on-call don’t understand what’s going on, the final escalation will be to a senior member of GOV.UK who can make a decision about how serious the problem is and contact other people on GOV.UK if required.
Phone calls from people
If you’re phoned by somebody who works on GOV.UK it’s likely that this is because:
- There’s a serious issue with the site which somebody else in government has noticed
- Government has decided to do emergency publishing
There’s a separate process for urgent changes to content which doesn’t require technical support (assuming everything is working).
On call charter
- Be available to be phoned in the evenings and at weekends
- Be able to be online to start investigating a problem within half an hour of being notified about it
- Don’t worry if you’re not able to answer the phone immediately - that’s why we have more than one person on-call
- Nobody is expected to understand every part of GOV.UK - you don’t need to know how to fix every issue on your own
- Logs are not as important as being available - if you need to lose some logs in order to bring the site back up, that’s probably a good trade-off to make
- Get paid. Make sure you submit your payment claim form after your shift.