govuk-ai-accelerator: Ontology Generation Cross-Repo Integration
This technical overview explains how the Workflow, Generator, Ontology Validator, Data Science Repo, and Content Workflow connect across the GOV.UK AI ontology repositories. It is intended for handover: a new team should be able to see which repository owns each part of the lifecycle, which runtime dependencies are involved, and which artifacts move between stages.
Repositories
| Name used here | Repository | Responsibility |
|---|---|---|
| Workflow | alphagov/govuk-ai-accelerator | Web app, ingestion workflow, ontology job orchestration, job tracking, artifact browsing, and ontology harness baseline comparison. |
| Generator | alphagov/govuk-ai-accelerator-tw-accelerator | Generator library. It reads domain inputs/configuration, runs the ontology pipeline, and writes schema, graph, and OWL/RDF artifacts. |
| Ontology Validator | alphagov/govuk-ai-accelerator-generator-e2e-testing-framework | Rule-based validator for generated Turtle (.ttl) ontology files, including naming/spelling checks and optional golden-schema comparison. |
| Data Science Repo | alphagov/govuk-ai-accelerator-tooling | Research and analysis notebooks, ground-truth ontology files, term extraction experiments, Bedrock exploration, and matching utilities. |
| Content Workflow | alphagov/govuk-ai-graph-tools | Downstream proof-of-concept graph and content-quality tooling that consumes ontology/knowledge-graph output for duplicate, outlier, and graph exploration workflows. |
Lifecycle
flowchart TD
subgraph workflowRepo["Workflow"]
urls["Source URLs"]
ingest["Ingest content<br/>/ontology/ingest"]
cleaned["Cleaned content<br/>S3 or local"]
submit["Submit job<br/>/ontology/submit"]
baseline["Accepted baseline<br/>accepted.json"]
harness["Compare baseline"]
report["Harness report<br/>regression_report.json"]
end
subgraph generatorRepo["Generator"]
generatorPackage["Installed package<br/>taxonomy-ontology-accelerator"]
generator["Build ontology<br/>OntologyPipelineBuilder"]
artifacts["Run artifacts<br/>schema.json<br/>graph.json<br/>ontology.ttl<br/>metrics CSV"]
end
subgraph dataScienceRepo["Data Science Repo"]
dataScience["Research + ground truth"]
end
subgraph validatorRepo["Ontology Validator"]
validator["Validate TTL"]
end
subgraph contentRepo["Content Workflow"]
contentWorkflow["Explore graph output"]
end
urls -->|"provide source list"| ingest
ingest -->|"extract and clean"| cleaned
cleaned -->|"stored as input"| submit
submit -->|"imports installed library"| generatorPackage
generatorPackage -->|"runs pipeline builder"| generator
generator -->|"writes output files"| artifacts
artifacts -->|"compare TTL and metrics"| harness
baseline -->|"provides accepted run"| harness
harness -->|"writes report"| report
artifacts -->|"process ontology.ttl"| validator
artifacts -->|"process graph.json"| contentWorkflow
dataScience -.->|"inform prompts and baselines"| generator
dataScience -.->|"inform validation rules"| validator
The Data Science Repo informs prompts, baselines, ground-truth checks, and validation expectations, but it is not part of the production run path.
Runtime And Support Paths
The production runtime path is the Workflow plus the Generator. The Workflow
ingests content, accepts ontology jobs, stores job state, and runs the Generator
as an installed Python package (taxonomy-ontology-accelerator) to produce
ontology artifacts. The Generator is a library dependency of the Workflow
runtime, not a separate service endpoint.
The supporting repositories sit around that path:
- The Ontology Validator checks generated
ontology.ttlfiles against agreed validation rules. - The Data Science Repo contains the notebooks, ground-truth files, and experiments that informed prompts, baselines, and validation expectations.
- The Content Workflow consumes generated graph output for downstream graph, duplicate, and outlier exploration.
Integration Contracts
These files are the main contracts between repositories. Treat their names, formats, and meanings as cross-repo dependencies.
| Artifact | Produced by | Consumed by | Contract |
|---|---|---|---|
ontology.ttl |
Generator via the Workflow | Workflow harness, Ontology Validator, reviewers | OWL/RDF Turtle export for the generated ontology. |
schema.json |
Generator via the Workflow | Workflow UI, reviewers, downstream consumers | Entity and relationship type definitions. |
graph.json |
Generator via the Workflow | Workflow UI, Content Workflow, reviewers | Generated ontology/knowledge graph structure. |
owl_ontology_metrics.csv |
Generator and Workflow harness | Workflow historical jobs view, deployment review | Run metrics, with harness result columns when the harness has run. |
regression_report.json |
Workflow harness | Deployment/review process | Baseline comparison report for the candidate run. |
baselines/accepted.json |
Maintained baseline manifest | Harness | Pointer to the immutable accepted baseline run. |
Version Coupling
The Workflow orchestrates the Generator, but Generator changes can alter the artifact shape, ontology terms, metrics, and validation results. A Generator update can therefore affect the Workflow UI, harness baseline, Ontology Validator expectations, and Content Workflow consumers.
When promoting or deploying Generator changes, check whether the accepted baseline, Ontology Validator fixtures, and Content Workflow assumptions still match the new outputs.
Stage Responsibilities
1. Workflow Ingestion
Owned by the Workflow.
The ingestion workflow starts from a list of GOV.UK URLs and produces cleaned
content files for a domain. The Workflow exposes POST /ontology/ingest; the
underlying scripts can also be run locally. Ingestion supports local or S3
storage through fsspec and writes timestamped logs for auditing.
Primary artifacts:
- raw downloaded HTML;
- extracted and cleaned markdown/text content;
- ingestion logs;
- domain input folders suitable for the Generator.
2. Generator Execution
Orchestrated by the Workflow, implemented by the Generator.
The Workflow accepts ontology jobs through POST /ontology/submit, tracks job
state in PostgreSQL, and runs the Generator package in a background task. The
Generator uses OntologyPipelineBuilder to set up the pipeline, extract
ontology data, deduplicate, build relationships, update the schema, validate,
save, and export the ontology.
Primary artifacts:
-
schema.json: entity and relationship type definitions; -
graph.json: generated ontology graph; -
ontology.ttl: OWL/RDF Turtle export used by the Ontology Validator and harness checks; -
config.yaml: persisted run configuration; - logs and run metadata;
-
owl_ontology_metrics.csvwhere enabled by the Generator workflow.
3. Workflow Harness Comparison
Owned by the Workflow.
The ontology harness is a post-deployment baseline check. When enabled, it runs
the normal Generator against a dedicated harness domain, reads an accepted
baseline manifest, compares baseline and candidate ontology.ttl metrics, and
writes a regression report.
Harness configuration:
-
ONTOLOGY_HARNESS_ENABLED: turns the scheduled harness on. If it is unset or false-like, the Workflow starts without queueing a harness job. -
ONTOLOGY_HARNESS_DEPLOYMENT_ID: identifies the deployment or Generator revision being checked. This is required when the harness is enabled because it becomes part of the one-job-per-deployment key. -
ONTOLOGY_HARNESS_DOMAIN: optional domain/folder name for the harness input and output. Defaults toontology-harness-baseline. -
ONTOLOGY_HARNESS_CONFIG_URI: optional config file location. Defaults tos3://<bucket>/<domain>/config.yaml. -
ONTOLOGY_HARNESS_BASELINE_MANIFEST_URI: optional accepted-baseline manifest location. Defaults tos3://<bucket>/<domain>/baselines/accepted.json. -
baselines/accepted.json: the manifest the harness reads to find the immutable baseline run to compare against.
Only ONTOLOGY_HARNESS_ENABLED=true and ONTOLOGY_HARNESS_DEPLOYMENT_ID are
needed to schedule the harness with the default S3/domain layout. The other
settings are overrides for non-default locations. If the post-deployment harness
is not being run, none of these variables are needed.
Primary artifacts:
- candidate run output folder;
-
regression_report.json; - harness summary columns added to
owl_ontology_metrics.csv.
4. Ontology Validation
Owned by the Ontology Validator.
The Ontology Validator validates generated ontology.ttl files. This is
validation/testing rather than LLM scoring. It checks naming conventions,
US-English spelling conventions, and optional golden-schema comparison.
Primary inputs:
- generated
ontology.ttl; - optional golden/reference
.ttl; - optional
.allowlistfor intentional domain terms.
Primary artifacts:
- command-line pass/fail output;
- JSON output when requested;
- numbered violation report folders containing a copy of the checked
.ttlandviolations.txt.
5. Data Science Repo
Owned by the Data Science Repo.
This repository contains experimental notebooks and helper utilities used during the ontology work. It is not the production runtime path, but it helps explain how ground-truth ontologies, extraction experiments, Bedrock exploration, and matching approaches informed Generator and Ontology Validator expectations.
Primary artifacts:
- ground-truth
.ttland.rdffiles; - notebooks for term extraction and enrichment experiments;
- matching utilities for direct, fuzzy, and semantic matching;
- Bedrock model exploration outputs.
6. Content Workflow
Owned by the Content Workflow.
The Content Workflow consumes ontology/knowledge-graph output downstream. It turns a knowledge graph into browsable graph views and content-quality signals, including semantic duplicate and outlier workflows.
Primary inputs:
- generated knowledge graph JSON;
- S3 source markdown/documents;
- optional OpenSearch index for retrieval.
Primary artifacts:
- graph view model output such as
graphNode.json; - visual graph views;
- duplicate and outlier analysis outputs.
Runtime Dependencies
| Area | Main dependencies |
|---|---|
| Workflow | Python 3.13, uv, Flask, Waitress, PostgreSQL, SQLAlchemy, AWS credentials, S3, fsspec. |
| Generator |
taxonomy-ontology-accelerator, LLM provider configuration, AWS Bedrock where configured, S3 or local filesystem storage. |
| Workflow harness | Same Generator dependencies plus baseline manifest and S3 access to baseline/candidate ontology.ttl files. |
| Ontology Validator | Python, uv, rdflib, OWL-RL inference, rule modules, optional golden .ttl. |
| Data Science Repo | Python notebooks, Bedrock access for experiments, ontology files, matching libraries such as sentence-transformers/rapidfuzz where used. |
| Content Workflow | Python 3.12, Flask/Uvicorn, AWS Bedrock, Amazon OpenSearch, S3, Cytoscape.js, Pydantic, uv. |
Artifact Flow
| Stage | Input | Output | Next consumer |
|---|---|---|---|
| Workflow ingestion | GOV.UK URLs or source list | Cleaned markdown/text, logs | Generator execution |
| Generator execution | Domain config, prompt, cleaned input content |
schema.json, graph.json, ontology.ttl, metrics/logs |
Workflow harness, Ontology Validator, Content Workflow, Workflow UI |
| Harness comparison | Candidate ontology.ttl, accepted baseline manifest |
regression_report.json, harness CSV columns |
Deployment/review process |
| Ontology validation | Generated ontology.ttl, optional golden .ttl
|
Pass/fail result, optional JSON, violation reports | Generator maintainers and handover reviewers |
| Content Workflow | Knowledge graph JSON and source documents | Graph visualisations, duplicate/outlier reports | Content quality review |
| Data Science Repo | Ground truth, generated runs, experiment data | Matching results, notebooks, prompts/analysis | Generator and Ontology Validator design decisions |
Change Impact Guide
| Change | Check these repositories |
|---|---|
Change ontology.ttl structure or naming conventions |
Workflow, Ontology Validator, Content Workflow. |
Change schema.json or graph.json shape |
Workflow, Content Workflow, any Data Science Repo notebooks that read generated output. |
| Change Generator prompts, models, or extraction logic | Harness baseline, Ontology Validator fixtures, ground-truth assumptions in the Data Science Repo. |
| Change harness metrics or report fields | Workflow historical jobs view, deployment review process, owl_ontology_metrics.csv consumers. |
| Promote a new accepted baseline |
baselines/accepted.json, harness run history, any handover notes explaining why the baseline changed. |
Where To Start When Debugging
| Symptom | Start here |
|---|---|
| Source URLs did not produce usable files | Workflow ingestion route and scripts/ingestion/README.md. |
| Ontology job failed or stopped | Workflow job status, logs, and scripts/pipeline/ontology_generator.py. |
| Generated output looks structurally wrong | Generator ontology engine docs and run book. |
| Harness failed after deployment | Workflow harness docs and scripts/pipeline/ontology_harness.py. |
ontology.ttl violates naming/spelling/golden expectations |
Ontology Validator run book and validator reports. |
| Graph/outlier UI output is missing or surprising | Content Workflow README and generated graph artifacts. |
| You need historical experiment context | Data Science Repo notebooks, ground truth data, and matching utilities. |