data.gov.uk: high traffic rate alert

Last updated: 23 Mar 2026

This alert fires when the origin request rate to data.gov.uk exceeds 44.6 req/s sustained for 5 minutes. This threshold is 80% of the peak capacity (55.8 req/s) established by load testing with 100 virtual users on the staging environment (March 2026).

Understanding the alert

The alert expression is:

sum(rate(fastly_rt_origin_fetches_total{service_name=~".*data.gov.uk"}[5m])) > 44.6

This measures origin requests — requests that reached the data.gov.uk infrastructure after the Fastly CDN. Cached responses (homepage, static assets) do not count. A sustained rate above 44.6 req/s means the remaining 20% headroom before the load-tested limit is being consumed and the site may be approaching saturation.

The load test baseline for reference:

Metric	Value
Peak load tested	100 virtual users
Peak throughput	55.8 req/s
Error rate at peak	0.00%
Find replicas (tested config)	3 × 4Gi, RAILS_MAX_THREADS=16

Impact

At this traffic level the site may still be healthy — the alert is early warning. If traffic continues to grow towards 55+ req/s without scaling, you can expect:

Find (Rails/Puma) pod latency to increase as Puma thread pools fill up
Potential pod OOMKills if traffic generates large Ruby heap allocations
504 Gateway Timeout errors at the ALB if pods become unresponsive

Checking the current state

1. Check Grafana

Open the data.gov.uk app requests dashboard and confirm:

Is the request rate genuinely elevated, or is this a brief spike?
Are error rates (4xx/5xx) increasing alongside the traffic?
Is the rate still climbing, or has it plateaued?

2. Check pod health

gds aws govuk-production-dguengineer -- \
  kubectl get pods -n datagovuk -l app=datagovuk-find

Look for pods in CrashLoopBackOff, OOMKilled (Exit Code 137), or showing a high restart count. If pods are already crashing, scale up immediately before investigating further.

3. Check current replica count

gds aws govuk-production-dguengineer -- \
  kubectl get deployment -n datagovuk datagovuk-find \
  -o jsonpath='{.spec.replicas}'

Normal baseline: Find = 3 replicas.

Scaling up to handle +10 req/s

The primary scaling lever is Find replica count. Adding one replica distributes load across an additional Puma instance, adding roughly 18 req/s of headroom.

Step 1: Increase Find replicas (3 → 4)

Edit charts/app-of-apps/values-production.yaml in the govuk-dgu-charts repository:

datagovukHelmValues:
  find:
    replicaCount: 4   # increased from 3

Create a PR, get it reviewed, and merge to main. ArgoCD will automatically roll out the change within a few minutes.

Step 2: Verify the rollout

gds aws govuk-production-dguengineer -- \
  kubectl rollout status deployment/datagovuk-find -n datagovuk

Once rollout is complete, monitor the Grafana dashboard for 10–15 minutes to confirm:

Request rate is being handled without increasing error rates
No new pod restarts or OOMKills
p(95) response time for Find endpoints remains healthy

Additional option: increase Puma thread count

If pod health issues appear (liveness probe failures, Puma thread pool exhaustion) rather than raw pod count being the constraint, also set:

datagovukHelmValues:
  find:
    config:
      railsMaxThreads: "16"   # default is 5

This was required during load testing at 100 VUs. Include it alongside the replica increase if you are seeing probe failures under load. Note: higher thread counts increase per-pod memory usage — verify pods stay within the 4Gi limit after deploying.

Scaling back to normal

Once the DataGovUkHighTrafficRate alert has resolved and traffic has returned to normal levels (below 44.6 req/s for at least 30 minutes), revert to baseline sizing to release cluster resources.

Revert charts/app-of-apps/values-production.yaml to the baseline:

datagovukHelmValues:
  find:
    replicaCount: 3   # back to baseline
    # remove railsMaxThreads if it was added

Create a PR with the description “Revert data.gov.uk scaling after high traffic alert resolved”, merge to main, and verify the rollout completes without errors.

Check that the alert does not re-fire within 15 minutes of scaling back.

Baseline production configuration (for reference)

Component	Setting	Baseline value
Find replicas	`find.replicaCount`	3
Find memory limit	`find.appResources.limits.memory`	4Gi
Find memory request	`find.appResources.requests.memory`	2Gi

These values are defined in charts/app-of-apps/values-production.yaml in the govuk-dgu-charts repository.