Repository: govuk-chat-evaluation

Tools to evaluate GOV.UK Chat

GitHub: govuk-chat-evaluation
Ownership: #ai-govuk owns the repo. #dev-notifications-ai-govuk receives automated alerts for this repo.
Category: Utilities

README

A CLI tool to run evaluation tasks on GOV.UK Chat. Each evaluation task is designed to test the performance of a specific component of the chat system, allowing for targeted diagnostics and improvement. Tasks typically take a dataset of user input and expected output, they then generate actual user output from GOV.UK Chat, finally the actual output is compared with expected output to produce results.

Nomenclature

Evaluation - the process of comparing the expected output and actual output to measure the quality of the actual output
Config - a mechanism to alter parameters and options for an individual evaluation
Dataset - a collection of data that contains input and expected output, may also contain actual output
Generate - the process of taking user input and generating the actual output with GOV.UK Chat
Results - data that is produced from an individual evaluation to provide qualitative insights

Technical documentation

Install uv to get started.

Dependencies

Run uv sync to install dependencies.

Evaluation tasks that generate responses from GOV.UK Chat require the application to be available in your ~/govuk directory.

The means to access input data is documented in data/README.md.

AWS credentials for Bedrock judges

If you run evaluations that use Amazon Bedrock (for example the Nova judge models), refresh temporary credentials with:

./scripts/export_aws_credentials.sh

The script writes .env.aws, which is automatically loaded by the CLI on startup. Re-run it whenever the credentials expire.

Usage

Run uv run govuk_chat_evaluation to view available evaluation tasks and options.

Development tasks

Run uv run pytest to run tests.
Run uv run ruff format to format the code.
Run uv run ruff check . to lint code base.
Run uv run pyright to validate the type hints.

Licence

MIT License