Skip to main content

Repository: govuk-document-clustering-experiment

Feasibility of suggesting new topics on GOV.UK using a sentence transformer model, document clustering & an LLM

README

Feasibility of using machine classification to improve GOV.UK taxonomy

Setup

  • Setup a GOV.UK development environment.
  • Setup the Content Tagger application so it can be run from govuk-docker.
  • Replicate the production data for Content Tagger locally.
  • Install the version of Python specified in .python-version, e.g. mise install python (with idiomatic_version_file_enable_tools enabled for Python).
  • Install pipenv by running pip install --user pipenv.
  • Install Python libraries by running pipenv install.
  • Sign up at Hugging Face and create a token of type “Read”.
  • Copy example environment file: cp .env.example .env.
  • Set value of HF_TOKEN in .env to the Hugging Face token.
  • Set the value of OPENROUTER_API_KEY in .env to the Open Router API key.
  • Generate the suggested topics: pipenv run ./suggest_topics <taxon-base-path>

Documentation

This repo is configured to generate a GitHub Pages website which is currently hosted at https://alphagov.github.io/govuk-document-clustering-experiment/. This provides some detail about how we ran some experiments using this code and the corresponding results.