Skip to main content
Last updated: 24 Jun 2026

govuk-ai-accelerator: Ontology Generation Cross-Repo Integration

This technical overview explains how the Workflow, Generator, Ontology Validator, Data Science Repo, and Content Workflow connect across the GOV.UK AI ontology repositories. It is intended for handover: a new team should be able to see which repository owns each part of the lifecycle, which runtime dependencies are involved, and which artifacts move between stages.

Repositories

Name used here Repository Responsibility
Workflow alphagov/govuk-ai-accelerator Web app, ingestion workflow, ontology job orchestration, job tracking, artifact browsing, and ontology harness baseline comparison.
Generator alphagov/govuk-ai-accelerator-tw-accelerator Generator library. It reads domain inputs/configuration, runs the ontology pipeline, and writes schema, graph, and OWL/RDF artifacts.
Ontology Validator alphagov/govuk-ai-accelerator-generator-e2e-testing-framework Rule-based validator for generated Turtle (.ttl) ontology files, including naming/spelling checks and optional golden-schema comparison.
Data Science Repo alphagov/govuk-ai-accelerator-tooling Research and analysis notebooks, ground-truth ontology files, term extraction experiments, Bedrock exploration, and matching utilities.
Content Workflow alphagov/govuk-ai-graph-tools Downstream proof-of-concept graph and content-quality tooling that consumes ontology/knowledge-graph output for duplicate, outlier, and graph exploration workflows.

Lifecycle

flowchart TD
    subgraph workflowRepo["Workflow"]
        urls["Source URLs"]
        ingest["Ingest content<br/>/ontology/ingest"]
        cleaned["Cleaned content<br/>S3 or local"]
        submit["Submit job<br/>/ontology/submit"]
        baseline["Accepted baseline<br/>accepted.json"]
        harness["Compare baseline"]
        report["Harness report<br/>regression_report.json"]
    end

    subgraph generatorRepo["Generator"]
        generatorPackage["Installed package<br/>taxonomy-ontology-accelerator"]
        generator["Build ontology<br/>OntologyPipelineBuilder"]
        artifacts["Run artifacts<br/>schema.json<br/>graph.json<br/>ontology.ttl<br/>metrics CSV"]
    end

    subgraph dataScienceRepo["Data Science Repo"]
        dataScience["Research + ground truth"]
    end

    subgraph validatorRepo["Ontology Validator"]
        validator["Validate TTL"]
    end

    subgraph contentRepo["Content Workflow"]
        contentWorkflow["Explore graph output"]
    end

    urls -->|"provide source list"| ingest
    ingest -->|"extract and clean"| cleaned
    cleaned -->|"stored as input"| submit
    submit -->|"imports installed library"| generatorPackage
    generatorPackage -->|"runs pipeline builder"| generator
    generator -->|"writes output files"| artifacts
    artifacts -->|"compare TTL and metrics"| harness
    baseline -->|"provides accepted run"| harness
    harness -->|"writes report"| report
    artifacts -->|"process ontology.ttl"| validator
    artifacts -->|"process graph.json"| contentWorkflow
    dataScience -.->|"inform prompts and baselines"| generator
    dataScience -.->|"inform validation rules"| validator

The Data Science Repo informs prompts, baselines, ground-truth checks, and validation expectations, but it is not part of the production run path.

Runtime And Support Paths

The production runtime path is the Workflow plus the Generator. The Workflow ingests content, accepts ontology jobs, stores job state, and runs the Generator as an installed Python package (taxonomy-ontology-accelerator) to produce ontology artifacts. The Generator is a library dependency of the Workflow runtime, not a separate service endpoint.

The supporting repositories sit around that path:

  • The Ontology Validator checks generated ontology.ttl files against agreed validation rules.
  • The Data Science Repo contains the notebooks, ground-truth files, and experiments that informed prompts, baselines, and validation expectations.
  • The Content Workflow consumes generated graph output for downstream graph, duplicate, and outlier exploration.

Integration Contracts

These files are the main contracts between repositories. Treat their names, formats, and meanings as cross-repo dependencies.

Artifact Produced by Consumed by Contract
ontology.ttl Generator via the Workflow Workflow harness, Ontology Validator, reviewers OWL/RDF Turtle export for the generated ontology.
schema.json Generator via the Workflow Workflow UI, reviewers, downstream consumers Entity and relationship type definitions.
graph.json Generator via the Workflow Workflow UI, Content Workflow, reviewers Generated ontology/knowledge graph structure.
owl_ontology_metrics.csv Generator and Workflow harness Workflow historical jobs view, deployment review Run metrics, with harness result columns when the harness has run.
regression_report.json Workflow harness Deployment/review process Baseline comparison report for the candidate run.
baselines/accepted.json Maintained baseline manifest Harness Pointer to the immutable accepted baseline run.

Version Coupling

The Workflow orchestrates the Generator, but Generator changes can alter the artifact shape, ontology terms, metrics, and validation results. A Generator update can therefore affect the Workflow UI, harness baseline, Ontology Validator expectations, and Content Workflow consumers.

When promoting or deploying Generator changes, check whether the accepted baseline, Ontology Validator fixtures, and Content Workflow assumptions still match the new outputs.

Stage Responsibilities

1. Workflow Ingestion

Owned by the Workflow.

The ingestion workflow starts from a list of GOV.UK URLs and produces cleaned content files for a domain. The Workflow exposes POST /ontology/ingest; the underlying scripts can also be run locally. Ingestion supports local or S3 storage through fsspec and writes timestamped logs for auditing.

Primary artifacts:

  • raw downloaded HTML;
  • extracted and cleaned markdown/text content;
  • ingestion logs;
  • domain input folders suitable for the Generator.

2. Generator Execution

Orchestrated by the Workflow, implemented by the Generator.

The Workflow accepts ontology jobs through POST /ontology/submit, tracks job state in PostgreSQL, and runs the Generator package in a background task. The Generator uses OntologyPipelineBuilder to set up the pipeline, extract ontology data, deduplicate, build relationships, update the schema, validate, save, and export the ontology.

Primary artifacts:

  • schema.json: entity and relationship type definitions;
  • graph.json: generated ontology graph;
  • ontology.ttl: OWL/RDF Turtle export used by the Ontology Validator and harness checks;
  • config.yaml: persisted run configuration;
  • logs and run metadata;
  • owl_ontology_metrics.csv where enabled by the Generator workflow.

3. Workflow Harness Comparison

Owned by the Workflow.

The ontology harness is a post-deployment baseline check. When enabled, it runs the normal Generator against a dedicated harness domain, reads an accepted baseline manifest, compares baseline and candidate ontology.ttl metrics, and writes a regression report.

Harness configuration:

  • ONTOLOGY_HARNESS_ENABLED: turns the scheduled harness on. If it is unset or false-like, the Workflow starts without queueing a harness job.
  • ONTOLOGY_HARNESS_DEPLOYMENT_ID: identifies the deployment or Generator revision being checked. This is required when the harness is enabled because it becomes part of the one-job-per-deployment key.
  • ONTOLOGY_HARNESS_DOMAIN: optional domain/folder name for the harness input and output. Defaults to ontology-harness-baseline.
  • ONTOLOGY_HARNESS_CONFIG_URI: optional config file location. Defaults to s3://<bucket>/<domain>/config.yaml.
  • ONTOLOGY_HARNESS_BASELINE_MANIFEST_URI: optional accepted-baseline manifest location. Defaults to s3://<bucket>/<domain>/baselines/accepted.json.
  • baselines/accepted.json: the manifest the harness reads to find the immutable baseline run to compare against.

Only ONTOLOGY_HARNESS_ENABLED=true and ONTOLOGY_HARNESS_DEPLOYMENT_ID are needed to schedule the harness with the default S3/domain layout. The other settings are overrides for non-default locations. If the post-deployment harness is not being run, none of these variables are needed.

Primary artifacts:

  • candidate run output folder;
  • regression_report.json;
  • harness summary columns added to owl_ontology_metrics.csv.

4. Ontology Validation

Owned by the Ontology Validator.

The Ontology Validator validates generated ontology.ttl files. This is validation/testing rather than LLM scoring. It checks naming conventions, US-English spelling conventions, and optional golden-schema comparison.

Primary inputs:

  • generated ontology.ttl;
  • optional golden/reference .ttl;
  • optional .allowlist for intentional domain terms.

Primary artifacts:

  • command-line pass/fail output;
  • JSON output when requested;
  • numbered violation report folders containing a copy of the checked .ttl and violations.txt.

5. Data Science Repo

Owned by the Data Science Repo.

This repository contains experimental notebooks and helper utilities used during the ontology work. It is not the production runtime path, but it helps explain how ground-truth ontologies, extraction experiments, Bedrock exploration, and matching approaches informed Generator and Ontology Validator expectations.

Primary artifacts:

  • ground-truth .ttl and .rdf files;
  • notebooks for term extraction and enrichment experiments;
  • matching utilities for direct, fuzzy, and semantic matching;
  • Bedrock model exploration outputs.

6. Content Workflow

Owned by the Content Workflow.

The Content Workflow consumes ontology/knowledge-graph output downstream. It turns a knowledge graph into browsable graph views and content-quality signals, including semantic duplicate and outlier workflows.

Primary inputs:

  • generated knowledge graph JSON;
  • S3 source markdown/documents;
  • optional OpenSearch index for retrieval.

Primary artifacts:

  • graph view model output such as graphNode.json;
  • visual graph views;
  • duplicate and outlier analysis outputs.

Runtime Dependencies

Area Main dependencies
Workflow Python 3.13, uv, Flask, Waitress, PostgreSQL, SQLAlchemy, AWS credentials, S3, fsspec.
Generator taxonomy-ontology-accelerator, LLM provider configuration, AWS Bedrock where configured, S3 or local filesystem storage.
Workflow harness Same Generator dependencies plus baseline manifest and S3 access to baseline/candidate ontology.ttl files.
Ontology Validator Python, uv, rdflib, OWL-RL inference, rule modules, optional golden .ttl.
Data Science Repo Python notebooks, Bedrock access for experiments, ontology files, matching libraries such as sentence-transformers/rapidfuzz where used.
Content Workflow Python 3.12, Flask/Uvicorn, AWS Bedrock, Amazon OpenSearch, S3, Cytoscape.js, Pydantic, uv.

Artifact Flow

Stage Input Output Next consumer
Workflow ingestion GOV.UK URLs or source list Cleaned markdown/text, logs Generator execution
Generator execution Domain config, prompt, cleaned input content schema.json, graph.json, ontology.ttl, metrics/logs Workflow harness, Ontology Validator, Content Workflow, Workflow UI
Harness comparison Candidate ontology.ttl, accepted baseline manifest regression_report.json, harness CSV columns Deployment/review process
Ontology validation Generated ontology.ttl, optional golden .ttl Pass/fail result, optional JSON, violation reports Generator maintainers and handover reviewers
Content Workflow Knowledge graph JSON and source documents Graph visualisations, duplicate/outlier reports Content quality review
Data Science Repo Ground truth, generated runs, experiment data Matching results, notebooks, prompts/analysis Generator and Ontology Validator design decisions

Change Impact Guide

Change Check these repositories
Change ontology.ttl structure or naming conventions Workflow, Ontology Validator, Content Workflow.
Change schema.json or graph.json shape Workflow, Content Workflow, any Data Science Repo notebooks that read generated output.
Change Generator prompts, models, or extraction logic Harness baseline, Ontology Validator fixtures, ground-truth assumptions in the Data Science Repo.
Change harness metrics or report fields Workflow historical jobs view, deployment review process, owl_ontology_metrics.csv consumers.
Promote a new accepted baseline baselines/accepted.json, harness run history, any handover notes explaining why the baseline changed.

Where To Start When Debugging

Symptom Start here
Source URLs did not produce usable files Workflow ingestion route and scripts/ingestion/README.md.
Ontology job failed or stopped Workflow job status, logs, and scripts/pipeline/ontology_generator.py.
Generated output looks structurally wrong Generator ontology engine docs and run book.
Harness failed after deployment Workflow harness docs and scripts/pipeline/ontology_harness.py.
ontology.ttl violates naming/spelling/golden expectations Ontology Validator run book and validator reports.
Graph/outlier UI output is missing or surprising Content Workflow README and generated graph artifacts.
You need historical experiment context Data Science Repo notebooks, ground truth data, and matching utilities.