govuk-chat: GOV.UK Chat Guardrails
What are guardrails
Guardrails are a means to set constraints and safeguards around a string of text. We have two kinds of guardrails: input and output.
Input guardrails
Input guardrails safeguard text generated by a user. This might include things like determining whether the user is trying to jailbreak the system, or trying to expose the underlying prompts.
Output guardrails
Output guardrails safeguard text generated by the LLM.
Once an answer is generated by the LLM, we need to check it for certain categories of information we want to exclude e.g. PII, advice on anything illegal, political rhetoric etc.
Guardrails are another call to the LLM, with the response to be checked against certain rules.
Guardrails in the codebase
JailbreakChecker
This checks the user's question to determine if it is a jailbreak attempt.
The LLM will output a pass or fail value. These values can be found in our jailbreak guardrails config.
MultipleGuardrail::Checker
This checks the response from the LLM against a set of guardrails.
The output of the LLM is as follows:
-
False | None- the response is OK. -
True | "3, 4"- guardrails 3 and 4 were triggered
We map these to meaningful names using the MultipleGuardrail::Prompt class. The Guardrail dataclass instances are populated using configuration pulled through from the guardrails config file in our private repository.
The MultipleGuardrail::Prompt class constructs the prompts we use to run the guardrails. You can access these prompts buy accessing a rails terminal and running:
prompts = AnswerComposition::MultipleGuardrail::Prompt.new(<guardrail-type>)
system_prompt = prompts.system_prompt
user_prompt = prompts.user_prompt(<your-question>)
Copy/paste these into the Anthropic workbench to investigate any issues.
You can also use the workbench to ask the reasoning behind any response it gives.