govuk-chat: 3. Output guardrails
Date: 2024-07-09
Context
We decided we needed to have a guardrail for the answer generated by the LLM following red team testing where a number of problematic scenarios were exposed that we didn’t believe we could resolve with prompt engineering,
We considered the following approaches:
-
Multiple guardrail functions i.e. multiple LLM calls:
- This approach uses multiple calls to the chat completions API, one for each guardrail that could vary in model required
- Pros: transparency and flexibility
- Cons: rate limits, token usage, latency and higher costs
- This approach uses multiple calls to the chat completions API, one for each guardrail that could vary in model required
-
A single guardrail function i.e. single LLM call:
- One call to the chat completions API using a single prompt that combines the individual prompts from the multiple guardrail functions
- Pros: rate limits, token usage, response times and lower costs
- Cons: transparency and risk of failure due to complex prompt
- One call to the chat completions API using a single prompt that combines the individual prompts from the multiple guardrail functions
-
Fine-tuned model:
- Customising a language model via many question/answer examples created synthetically using the chat completions API
- Pros: accuracy and tailored responses
- Cons: inference incurs additional costs, requires re-training, requires large training set
- Customising a language model via many question/answer examples created synthetically using the chat completions API
We considered the following approaches to surfacing an error message:
- Surfacing a single message informing a user that they have failed a guardrail
- Surfacing a single message informing a user which specific guardrail(s) the generated answer had failed
Decision
In the first instance, in order to meet the deadline for live pilot with sufficient time for end-to-end testing and iteration, we implement a version of guardrails that uses a single OpenAI chat completions API call after answer generation, uses GPT-4o, and produces a single response highlighting that an answer generated by a user input has triggered a guardrail response.
Rationale
Testing on a single guardrail function showed good results for accuracy, and reduced resource costs in terms of tokens, rates, and latency. These reductions are in comparison to testing performed on multiple guardrail functions, iterating on a single guardrail function, and preliminary work performed fine-tuning gpt-3.5.
For further details see here: post-generation guardrails summary doc
Status
Accepted
Consequences
- Decisions have been made to meet the date of live pilot
- Guardrail responses will not initially meet the level of user centred design as in other parts of service
- Guardrail responses will need to be iterated and accuracy assessed
- Fine-tuning a model will not be performed and therefore no evidence based comparison will be available
- The relative importance of order and hierarchy will not be accounted for in the initial integration implementation and will need to be investigated
- Impact on live will need to be assessed, with particular attention given to the false positives rate which would negatively impact user experience by constricting functionality (other types of result would negatively impact user experience by failing to safeguard against inappropriate messages but this is acceptable in a closed pilot)