Warning This document has not been updated for a while now. It may be out of date.

Last updated: 8 Apr 2025

govuk-chat: GOV.UK Chat Guardrails

What are guardrails

Guardrails are a means to set constraints and safeguards around a string of text. We have two kinds of guardrails: input and output.

Input guardrails

Input guardrails safeguard text generated by a user. This might include things like determining whether the user is trying to jailbreak the system, or trying to expose the underlying prompts.

Output guardrails

Output guardrails safeguard text generated by the LLM.

Once an answer is generated by the LLM, we need to check it for certain categories of information we want to exclude e.g. PII, advice on anything illegal, political rhetoric etc.

Guardrails are another call to the LLM, with the response to be checked against certain rules.

Guardrails in the codebase

JailbreakChecker

This checks the user's question to determine if it is a jailbreak attempt.

The output of the LLM is either a 1 or a 0.

MultipleChecker

This checks the response from the LLM against a set of guardrails.

The output of the LLM is as follows:

False | None - the response is OK.
True | "3, 4" - guardrails 3 and 4 were triggered

We map these to meaningful names using the mappings from a config file, e.g. here.

The file also contains the prompts we use to run the guardrails. Copy/paste these into the OpenAI chat playground to investigate any issues.

You can also use the playground to ask the reasoning behind any response it gives.

Printing prompts

The guardrails:print_prompts rake task outputs the combined system and user prompt for the answer or question routing guardrails. It takes one argument guardrail_type which is the type of guardrail prompt you want to output. It must be either answer_guardrails or question_routing_guardrails.

The rake task outputs to stdout. Here is an example that outputs the answer guardrails prompt:

rake guardrails:print_prompts["answer_guardrails"]