GOV.UK Chat Alerts
Note
For any queries or assistance regarding Chat Alerts, please post in the #dev-notifications-ai-govuk Slack channel
Alerts by Priority
For any alert, it is worth checking the Grafana GOV.UK Chat Technical dashboards to see if any part of the service is showing signs of an issue:
The Grafana dashboards show usage of:
- Application rate limits
- Bedrock tokens usage
- Bedrock invocations
- Bedrock Service Exceptions
- HTTP requests
- Pod CPU & Memory
- Opensearch cluster
- RDS Postgres
- Elasticache Redis
- Sidekiq queues
Pagerduty Alerts
High5xxRate
The High5xxRate alert is triggered when 10% of total requests return a 5xx for more than 5 minutes. The 503 error code has been excluded from being alerted on, as that will be used when the Chat service is disabled, so we do not want that to go to Pagerduty.
Slack Alerts
LongRequestDuration
The LongRequestDuration alert indicates that a backend part of the service is not responding as expected, and as a result the user experience is severely impacted. For example, we have seen it fire when the RDS Postgres Database is offline.
HighPod* alerts
The following alerts relate to EKS Pod performance:
Horizontal Pod Autoscaling (HPA) has been employed, so along with the rate limiting in place, we should not see these trigger but if they do, it could indicate an issue with the EKS cluster. HPA configuration can be found in the relevant values-{environment}.yaml files here.
SidekiqAnswerQueueLength
The SidekiqAnswerQueueLength is used as the metric to scale the worker pods, so an increase in this will likely be down to high load. If the age of the oldest job in the Sidekiq answer queue is increasing, it may be a process has got stuck.
SidekiqAnswerJobAge
The SidekiqAnswerJobAge alert indicates that a job has been in the answer queue for 15 minutes. We should process jobs from this queue immediately, as any delay will significantly degrade the user experience. If this alert fires, we should ensure that autoscaling is working as expected and jobs are being processed.
SidekiqDefaultJobAge
The SidekiqDefaultJobAge alert indicates that a Sidekiq job has been in the default queue for over 6 hours. These jobs are not time-sensitive, so our tolerance for delaying their processing is higher. However, if they have not been processed in 6 hours, there is either an issue with jobs not being processed or a concerningly large backlog, and we should investigate.
Bedrock token threshold alerts
The govuk-chat-bedrock-token-threshold-50 and govuk-chat-bedrock-token-threshold-100 alerts are for us to use as data for deciding if higher limits need to be requested of AWS. These alerts are configured in AWS Cloudwatch Alarms, rather than Prometheus like the other alerts.
The dashboard graph showing Bedrock Service Exceptions is an indication of rate limiting or problems on the AWS side. ValidationException happens when the invoke request has too many input tokens and ServiceUnavailableException is a result of the model being throttled by AWS.
To get the error message relating to the Cloudwatch error code for the exceptions, the lookup attribute User name can be used to filter events in Cloudtrail.
ElevatedAnswerErrorStatuses
The ElevatedAnswerErrorStatuses alert fires when three or more answers generated in the last 20 minutes have an error status.
To determine what the the issue is you should naviate to the questions page in the admin console for the relevant environment (this link is for production). Then view each questions with an error status and navigate to to the error message row.