govuk-ai-accelerator: 001. Use a PostgreSQL-backed queue for ontology generation
Date: 2026-06-24
Status: Accepted
Documents decision implemented: 2026-04-02, then hardened on 2026-04-08 and 2026-06-01.
Context
Ontology generation is a long-running operation. The /ontology/submit endpoint
needs to accept a request quickly, return a job ID, and let users track the job
from the UI while the Generator works in the background. A single HTTP request
must not stay open for the whole generation run.
The Workflow app is also deployed in an environment that can run more than one
pod. Queueing therefore has to coordinate workers across pods, survive ordinary
pod restarts, and avoid two pods processing the same ontology request at the
same time. The existing processing_job table already held user-visible job
state, so the queueing mechanism needed to keep that state consistent with the
Jobs and Review Ontologies pages.
The Generator pipeline also has shared process-level state in dependencies such
as logging, stdout handling, caches, and fsspec. Earlier multi-worker attempts
showed that unconstrained in-process parallelism could produce duplicate runs or
jobs that became stuck. The queueing design therefore had to prefer correctness
and operational simplicity over maximum throughput.
This ADR covers ontology generation jobs and ontology harness jobs dispatched
through scripts/pipeline/task_manager.py. Ingestion jobs use the same
ProcessingJob model for status tracking, but their direct executor submission
path is outside this decision.
Decision
Use PostgreSQL as the durable queue by persisting each ontology generation
request as a processing_job row with status = "pending".
Each app instance starts a task manager thread unless DISABLE_TASK_MANAGER=true
is set. The task manager polls for the oldest pending job, claims it with
SELECT ... FOR UPDATE SKIP LOCKED, and changes the row to running in the
same transaction. Claiming records:
-
claimed_by, identifying the pod or host that owns the attempt; -
claimed_at, recording when the lease began; -
last_progress_at, recording recent Generator progress; -
attempt_count, incremented on every claim and used as a fencing token.
Claimed jobs run in the app's bounded in-process executor. The current worker
limit is one job per pod (EXECUTOR_MAX_WORKERS = 1) to avoid thread-safety
issues in the Generator's runtime dependencies. More pods can still increase
overall throughput because each pod can claim a different pending row.
The task manager also runs maintenance:
- PostgreSQL advisory lock
420021elects a single pod to run queue maintenance at a time. - Running jobs with no recent progress are requeued until they reach
MAX_JOB_ATTEMPTS. - Jobs that exceed the attempt limit are marked failed instead of being retried forever.
- Very old pending or running jobs are failed after 24 hours.
Generation code updates last_progress_at at pipeline checkpoints. When a
worker finishes, it writes the final job status only if its attempt_count
still matches the database row and the job has not been manually stopped. This
prevents a stale attempt from overwriting a newer attempt's status or run ID.
Considered Alternatives
Synchronous request processing
Run the Generator inside the /ontology/submit request and return only when it
finishes.
This was rejected because ontology generation can run much longer than a normal web request. It would tie user experience and load balancer behaviour to pipeline duration, make retries ambiguous, and provide no durable status trail for the Jobs UI.
In-memory queue and background threads only
Use Python's in-memory queue or submit directly to the process executor from the request handler.
This is simple for one local process, but it is not durable. Jobs can disappear when a pod restarts, and each pod would only know about work submitted to that pod. It also does not provide safe cross-pod coordination, which is needed when the app scales horizontally.
External queue or worker system
Introduce a separate broker and worker system such as Amazon SQS, Celery, RQ, or another managed queue.
This would provide mature queue features, clearer worker separation, and better tools for retries and monitoring. It was not selected because it would add infrastructure, deployment, secrets, local-development setup, and another source of truth for status. At the current scale, the Workflow app already depends on PostgreSQL for job state, so a separate queue would add operational complexity without enough immediate benefit.
Single queue leader processes all jobs
Elect one pod as a queue leader and let only that pod claim and run jobs.
This reduces duplicate-processing risk, but it underuses extra pods and makes the leader a throughput bottleneck. The chosen design keeps advisory-lock leadership only for maintenance and lets every pod claim work safely with row locks.
PostgreSQL-backed job table queue
Use the existing processing_job table as both the user-visible job log and the
queue.
This was selected because it gives durable job submission, status polling, and
cross-pod coordination with infrastructure the app already requires. PostgreSQL
row locks with SKIP LOCKED let multiple pods claim different pending jobs
without central coordination. The trade-off is that queue behaviour now depends
on careful database transitions, polling, lease cleanup, and fencing of stale
attempts.
Consequences
The queue has at-least-once execution semantics. The system prevents stale attempts from writing final state, but a worker can still consume compute until it reaches a progress checkpoint and notices that its attempt has been superseded or stopped.
PostgreSQL is now part of the critical path for durable ontology processing. Without a working database, pending jobs cannot be reliably queued, claimed, or recovered. SQLite remains useful for local tests and development, but production queue semantics rely on PostgreSQL features such as row locks and advisory locks.
Throughput scales by adding pods, not by increasing threads inside a pod. That keeps the Generator safer while its runtime dependencies have shared global state, but it also means one long job can occupy a pod's worker slot for a long time.
The queue manager is intentionally small and app-local. This keeps deployment simple, but it means the team owns queue concerns such as polling interval, stale-job thresholds, retry limits, and operational observability.
Revisit This Decision When
- ontology generation volume requires many concurrent workers per pod;
- queue metrics, delayed retries, dead-letter handling, or prioritisation become operational requirements;
- the Generator is separated into an independently deployed worker service;
- PostgreSQL load from queue polling or locking becomes material;
- ingestion, harness, and ontology jobs need one unified queue with pipeline-specific dispatch and worker pools;
- the Generator dependencies become safe for higher in-process concurrency.
At that point, a managed queue or a dedicated worker framework should be reconsidered with the current production traffic, failure modes, and operating model in hand.
References
-
govuk_ai_accelerator_app.py:ProcessingJob,/ontology/submit, app startup, and stop/status routes. -
scripts/pipeline/task_manager.py: polling, claiming, maintenance, and worker execution. -
scripts/pipeline/ontology_generator.py: progress updates, stop checks, attempt fencing, and final status writes. -
migrations/versions/c4e92f1b7a10_add_job_leases.py: lease and attempt-count columns. -
migrations/versions/9f3d7b6a1c21_add_last_progress_at.py: progress timestamp column. -
tests/test_task_manager.pyandtests/test_job_fencing.py: expected queue, recovery, and stale-attempt behaviour.