Introduction
- A common approach to detect unwanted behavior is to use an AI monitor
- Monitorability: how easy an AI system is to monitor
SHUSHCAST
Four components:
- Agents
- Monitors
- Main tasks
- Side tasks: need to be completed without being caught by the monitor
Monitor given a prompt with:
- The main task
- A monitoring instruction
- The agent transcript
- The agent
Monitor assigns a number in $[0, 1]$ to transcripts, giving probability it should be flagged according to some metric.
To measure performance of an agent against a monitor, look at a percentage of agent runs where:
- The agent succeeds at the side task
- The run is not flagged as suspicious
This is the “percent uncaught”. We label models hr / vr depending on whether chain-of-thought is shown.
Limitations and future work
- Agents have not been elicited to conceal their reasoning or optimized for performance
- Better monitor prompting
- Small set of tasks