GraphQL
GraphQL is an API query language that is used by GOV.UK to serve some live traffic. Our implementation exists in Publishing API and is used by multiple frontend applications.
Traffic is allocated somewhat randomly to either Content Store or GraphQL, to allow us to compare response times between the two. The decision is made in the frontend applications based on a level of traffic that has been configured through environment variables for each schema (named GRAPHQL_RATE_SCHEMA_NAME).
Monitoring GraphQL traffic
A Grafana dashboard exists in each environment to provide metrics on the traffic levels and response times of our GraphQL implementation:
Dealing with GraphQL alerts
Elevated GraphQL request duration
This alert is triggered when a certain request duration for GraphQL queries exceeds a defined level.
Look for changes in traffic to specific schemas
The queries for some schemas are known to execute faster than others, due to differing complexity between them. The average response times could be pushed upwards by a higher number of requests for a specific schema.
The GraphQL dashboards include data broken down by schema, which could be useful in identifying changes in traffic to a specific schema.
It may be necessary to switch off traffic until this is optimised.
Look for slow requests to specific URLs
Logit can be used to identify specific URLs that are slow performing.
The following Kibana query will show all requests the the Publishing API read replica’s live_content endpoint:
kubernetes.labels.app_kubernetes_io/name:"publishing-api-read-replica" AND "/graphql/content/*"
Look at the logjson.request_time field to identify the total request duration. Sort this column in descending order to see the slowest queries.
A specific URL being slow could be caused by that specific page having a larger amount of data than others of the same schema. Check to see if other pages of the same schema are affected, then consider optimising the query. It may be necessary to switch off traffic until this is optimised.
Elevated GraphQL error rate
This alert is triggered when the rate of 200 responses from GraphQL queries fall below a defined level.
Note: GraphQL will return a 200 response code when the client makes an invalid query (e.g. requesting a field that does not exist). This is because there is no server error, rather the client has provided invalid data. See the documentation for more details on error responses.
The GraphQL dashboards include error codes broken down by frontend application. This will give information of what type of error is occuring and whether it affects requests from one or many frontend applications.
The app request rates, errors and durations dashboard could be useful to see when the errors started to occur and their frequency.
If the errors are server-side application errors (i.e. 500 response codes), these are logged in Sentry. Navigate to the app-publishing-api project to examine any recent errors logged from the GraphqlController.
It may be necessary to switch off traffic until the error is fixed.
Switching off public GraphQL traffic
Specific schemas
If a specific schema is identified as being slow, it may be useful to switch off traffic to that schema until the query can be optimised. This is done by removing the environment variables from the relevant frontend application (example PR) or setting the value to zero.
All schemas
To switch off all public traffic, set the GRAPHQL_RATE_* environment variables to zero in the relevant environment, by updating govuk-helm-charts:
Changes to production values in govuk-helm-charts require a review from the Platform Engineering team, which can be requested via #govuk-platform-engineering Slack channel.
Emergency procedure (e.g. out of hours)
If an urgent change is required and it is not practical to go through the normal Helm chart workflow, you can update the environment variables directly on the cluster for the affected frontend applications.
⚠️ This should only be used in emergencies. Any direct changes must be reconciled with govuk-helm-charts as soon as possible.
Step 1: Disable Argo CD auto-sync
- Open the Argo CD UI
- Find the relevant frontend application.
- Go to Details.
- Untick ENABLE AUTO-SYNC.
- Ignore the AUTOMATED DISABLE AUTO-SYNC option.
- Ignore the sync status indicator at the top of the UI that says “Auto sync is enabled”.
This prevents Argo from immediately reverting your manual changes.
Step 2: Edit the deployment
Run the following command: kubectl edit deploy <frontend_app_name>
In the editor:
- Locate the
GRAPHQL_RATE_*environment variables. - Set all of them to
"0". - Save and exit.
The pod should restart automatically after the change.
Step 3: Verify rollout
In the Argo CD UI:
- Sync status will show as “OutOfSync”.
- App health will briefly appear as “Progressing” before changing to “Healthy”.
- Live manifest will reflect changes made in the cluster.
If the pods do not restart automatically, trigger a manual restart: kubectl rollout restart deploy <frontend_app_name>
After the incident (in-hours)
- Apply the same changes in
govuk-helm-chartsrepository via a pull request. - Follow the standard review and approval process.
- Once the Helm chart changes are deployed, re-enable Auto-Sync in Argo CD for the affected applications.
This ensures the system returns to a fully managed and declarative state.