Raise issues with Reliability Engineering
When on 2nd line you may experience an issue with GOV.UK that requires asking the Site Reliability Engineers (SREs) who work on GOV.UK infrastructure for assistance. The SREs previously worked in the RE GOV.UK team in Reliability Engineering, but currently they work as part of the Replatforming team. It is best to use RE GOV.UK channels for communication.
If you require urgent assistance
Check the On call schedule for GOV.UK SRE in-hours to find out who is on the rota.
It is also possible to “Run a Play” in the context of an ongoing incident page in PagerDuty. This will automatically call the RE engineer on duty both in- and out-of-hours.
If you need to handover a long-standing incident
If this is in-hours: A Site Reliabilty Engineer should take over the incident lead role. A 2nd line GOV.UK engineer will continue the comms lead role. If this is out-of-hours: The primary GOV.UK engineer should be the incident lead. The secondary GOV.UK engineer should be the comms lead.
There is no longer an RE GOV.UK out of hours rota. GOV.UK engineers on the in-hours and out of hours rotas should have all access and documentation required to address any issues.
If a problem is not urgent
If the issue you’ve identified seems like a non-urgent story you can add it the GOV.UK 2nd Line trello board in the “Ongoing issues to be aware of & unexplained events” column. The 2nd line tech lead(s) will then decide whether to pass this on to another team, manage the ticket through its life cycle, or to resolve this problem themselves.
Raising a Zendesk ticket with Reliability Engineering
The official way to communicate with Reliability Engineering is through Zendesk tickets.
To raise a ticket:
- Create a new ticket on Zendesk
- Enter yourself as the requester
- Set assignee to “3rd Line–GDS Reliability Engineering”
- Add the 2nd Line Delivery Manager as a CC recipient
- Fill in and submit ticket
- Monitor and contribute to the ticket until it is resolved
Understanding what Reliability Engineering can assist with
There is a broad explanation of the different areas of support in GOV.UK in ask for help.
More specificially to GOV.UK these are things that fall under the responsibility of Reliability Engineering (RE):
- GOV.UK Puppet - RE are responsible for maintenance and evolution, but as GOV.UK merge changes they can too be responsible for problems
- Upgrading software packages that are end-of-life/have security issues/no longer fit for purpose
- Running and maintaining the Terraform configurations for AWS;
- Maintaining the mirror configuration
- Keeping the CI environment running - GOV.UK are responsible for job configuration