Raise issues with Reliability Engineering
When on 2nd line you may experience an issue with GOV.UK that requires asking the Site Reliability Engineers (SREs) who work on GOV.UK infrastructure for assistance. The SREs previously worked in the RE GOV.UK team in Reliability Engineering, but currently they work as part of the Replatforming team. It is best to use RE GOV.UK channels for communication.
If you require urgent assistance
Contact GOV.UK SREs have a slack channel - #govuk-2ndline. By posting in that channel you can get their attention. This channel can be used for general queries too so do indicate in your message that a problem is time critical.
It is also possible to “Run a Play” in the context of an ongoing incident page in PagerDuty. This will automatically call the RE engineer on duty both in- and out-of-hours.
If you need to handover a long-standing incident
If this is in-hours: An Site Reliabilty Engineer from the RE GOV.UK team should take over the incident lead role. A 2nd line GOV.UK engineer will continue the comms lead role. If this is out-of-hours: The primary GOV.UK engineer should be the incident lead. The secondary GOV.UK engineer should be the comms lead.
There is no longer an RE GOV.UK out of hours rota. GOV.UK engineers on the in-hours and out of hours rotas should have all access and documentation required to address any issues.
If a problem is not urgent
If the issue you’ve identified seems like a non-urgent story you can add it the GOV.UK 2nd Line trello board in the “Proposed stories for Platform Health” column. Platform Health will then decide whether to raise this with RE, and manage the ticket through its life cycle, or to resolve this problem themselves.
Raising a Zendesk ticket with Reliability Engineering
The official way to communicate with Reliability Engineering is through Zendesk tickets.
To raise a ticket:
- Create a new ticket on Zendesk
- Enter yourself as the requester
- Set assignee to “3rd Line–GDS Reliability Engineering”
- Add the 2nd Line Delivery Manager as a CC recipient
- Fill in and submit ticket
- Monitor and contribute to the ticket until it is resolved
Understanding what Reliability Engineering can assist with
There is a broad explanation of the different areas of support in GOV.UK in ask for help.
More specificially to GOV.UK these are things that fall under the responsibility of Reliability Engineering (RE):
- GOV.UK Puppet - RE are responsible for maintenance and evolution, but as GOV.UK merge changes they can too be responsible for problems
- Upgrading software packages that are end-of-life/have security issues/no longer fit for purpose
- Running and maintaining the Terraform configurations for AWS;
- Backup software such as Duplicity
- Maintaining the mirror configuration
- Keeping the CI environment running - GOV.UK are responsible for job configuration
- Fabric scripts