Repository: govuk-document-clustering-experiment

Feasibility of suggesting new topics on GOV.UK using a sentence transformer model, document clustering & an LLM

GitHub: govuk-document-clustering-experiment
Ownership: #govuk-publishing-tagging-workflow
Category: Utilities

README

Feasibility of using machine classification to improve GOV.UK taxonomy

Setup

Setup a GOV.UK development environment.
Setup the Content Tagger application so it can be run from govuk-docker.
Replicate the production data for Content Tagger locally.
Install the version of Python specified in .python-version, e.g. mise install python (with idiomatic_version_file_enable_tools enabled for Python).
Install pipenv by running pip install --user pipenv.
Install Python libraries by running pipenv install.
Sign up at Hugging Face and create a token of type “Read”.
Copy example environment file: cp .env.example .env.
Set value of HF_TOKEN in .env to the Hugging Face token.
Set the value of OPENROUTER_API_KEY in .env to the Open Router API key.
Generate the suggested topics: pipenv run ./suggest_topics <taxon-base-path>

Documentation

This repo is configured to generate a GitHub Pages website which is currently hosted at https://alphagov.github.io/govuk-document-clustering-experiment/. This provides some detail about how we ran some experiments using this code and the corresponding results.