Table of contents

Raise issues with Reliability Engineering

Reliability Engineering are a programme in GDS, they are responsible for the infrastructure that most GDS software runs on and the underlying network configuration. They also provide shared software services used by various GDS programmes such as logging and monitoring tools.

When on 2nd line you may experience an issue with GOV.UK that requires asking the RE GOV.UK team in Reliability Engineering for assistance.

There are Reliability Engineering docs for users of their systems. There are also other Reliability Engineering docs for use by the team, these may contain more technical details.

If you require urgent assistance

The RE GOV.UK have a Slack channel - #re-govuk - and they have an assigned interruptible person. By posting in that channel you can get their attention. This channel can be used for general queries too so do indicate in your message that a problem is time critical.

Failing slack communication you can also walk over to the RE GOV.UK team desks and talk to the interruptible person directly - they are currently on the 6th floor near bank 27-28.

You may be advised to create a Zendesk ticket.

If you need to handover a long-standing incident

If this is in-hours: A Site Reliabilty Engineer from the RE GOV.UK team should take over the incident lead role. A 2nd line GOV.UK engineer will continue the comms lead role.

If this is out-of-hours: The primary GOV.UK engineer should be the incident lead and support the RE GOV.UK out of hours engineer to fix the problem. The secondary GOV.UK engineer should be the comms lead.

To escalate an incident to the RE GOV.UK engineer via PagerDuty:

  1. Click on the incident in PagerDuty
  2. Click on ‘Run a play’ to select ‘call the RE GOV.UK on-call person’

If a problem is not urgent

You can use the #reliability-eng slack channel for advice. If the issue you’ve identified seems like a non-urgent story you can add it the GOV.UK 2nd Line trello board in the “Proposed stories for Platform Health” column. Platform Health will then decide whether to raise this with RE, and manage the ticket through its life cycle, or to resolve this problem themselves.

Raising a Zendesk ticket with Reliability Engineering

The official way to communicate with Reliability Engineering is through Zendesk tickets.

To raise a ticket:

  1. Create a new ticket on Zendesk
  2. Enter yourself as the requester
  3. Set assignee to “3rd Line–GDS Reliability Engineering”
  4. Add the 2nd Line Delivery Manager as a CC recipient
  5. Fill in and submit ticket
  6. Monitor and contribute to the ticket until it is resolved

Understanding what Reliability Engineering can assist with

There is a broad explanation of the different areas of support in GOV.UK in who do I ask for support?.

More specificially to GOV.UK these are things that fall under the responsibility of Reliability Engineering (RE):

  • GOV.UK Puppet - RE are responsible for maintenance and evolution, but as GOV.UK merge changes they can too be responsible for problems
  • Upgrading software packages that are end-of-life/have security issues/no longer fit for purpose
  • Running and maintaining the Terraform configurations for AWS;
  • Backup software such as Duplicity
  • Maintaining the mirror configuration
  • Keeping the CI environment running - GOV.UK are responsible for job configuration
  • Fabric scripts
This page was last reviewed on 13 May 2019. It needs to be reviewed again on 13 August 2019 by the page owner #govuk-2ndline .
This page was set to be reviewed before 13 August 2019 by the page owner #govuk-2ndline. This might mean the content is out of date.