Skip to main content
Last updated: 25 Sep 2024

GOV.UK Chat Alerts

Note

For any queries or assistance regarding Chat Alerts, please post in the #dev-notifications-ai-govuk Slack channel

Alerts by Priority

For any alert, it is worth checking the Grafana GOV.UK Chat Technical Dashboards to see if any part of the service is showing signs of an issue:

Pagerduty Alerts

  • High5xxRate

The 503 error code has been excluded from being alerted on, as that will be used when the Chat service is disabled, so we do not want that to go to Pagerduty.

Slack Alerts

  • LongRequestDuration
  • HighPodCPUFE
  • HighPodCPUWorker
  • HighPodMemoryFE
  • HighPodMemoryWorker
  • NearTokensRateLimit
  • NearRequestsRateLimit

The LongRequestDuration alert indicates that a backend part of the service is not responding as expected, and as a result the user experience is severely impacted. For example, we have seen it fire when the RDS Postgres Database is offline.

Any of the High alerts usually indicate that the service is under heavy load. In this situation, it may be necessary to increase the number of relevent pods running by modifying the replicaCount for FE or workerReplicaCount for Worker pods in the govuk-chat section of the relevant values Helm Chart file found here.

The NearTokensRateLimit and NearRequestsRateLimit alerts indicate that over 80% of the OpenAI limit for tokens or requests per minute has been used.