Warning This document has not been updated for a while now. It may be out of date.

Last updated: 17 Jul 2024

govuk-chat: 3. Output guardrails

Date: 2024-07-09

Context

We decided we needed to have a guardrail for the answer generated by the LLM following red team testing where a number of problematic scenarios were exposed that we didn’t believe we could resolve with prompt engineering,

We considered the following approaches:

Multiple guardrail functions i.e. multiple LLM calls:
- This approach uses multiple calls to the chat completions API, one for each guardrail that could vary in model required
  - Pros: transparency and flexibility
  - Cons: rate limits, token usage, latency and higher costs
A single guardrail function i.e. single LLM call:
- One call to the chat completions API using a single prompt that combines the individual prompts from the multiple guardrail functions
  - Pros: rate limits, token usage, response times and lower costs
  - Cons: transparency and risk of failure due to complex prompt
Fine-tuned model:
- Customising a language model via many question/answer examples created synthetically using the chat completions API
  - Pros: accuracy and tailored responses
  - Cons: inference incurs additional costs, requires re-training, requires large training set

We considered the following approaches to surfacing an error message:

Surfacing a single message informing a user that they have failed a guardrail
Surfacing a single message informing a user which specific guardrail(s) the generated answer had failed

Decision

In the first instance, in order to meet the deadline for live pilot with sufficient time for end-to-end testing and iteration, we implement a version of guardrails that uses a single OpenAI chat completions API call after answer generation, uses GPT-4o, and produces a single response highlighting that an answer generated by a user input has triggered a guardrail response.

Rationale

Testing on a single guardrail function showed good results for accuracy, and reduced resource costs in terms of tokens, rates, and latency. These reductions are in comparison to testing performed on multiple guardrail functions, iterating on a single guardrail function, and preliminary work performed fine-tuning gpt-3.5.

For further details see here: post-generation guardrails summary doc

Status

Accepted

Consequences

Decisions have been made to meet the date of live pilot
Guardrail responses will not initially meet the level of user centred design as in other parts of service
Guardrail responses will need to be iterated and accuracy assessed
Fine-tuning a model will not be performed and therefore no evidence based comparison will be available
The relative importance of order and hierarchy will not be accounted for in the initial integration implementation and will need to be investigated
Impact on live will need to be assessed, with particular attention given to the false positives rate which would negatively impact user experience by constricting functionality (other types of result would negatively impact user experience by failing to safeguard against inappropriate messages but this is acceptable in a closed pilot)