Early work on monitorability evaluations

Introduction


  • A common approach to detect unwanted behavior is to use an AI monitor
  • Monitorability: how easy an AI system is to monitor

SHUSHCAST


Four components:

  • Agents
  • Monitors
  • Main tasks
  • Side tasks: need to be completed without being caught by the monitor

Monitor given a prompt with:

  • The main task
  • A monitoring instruction
  • The agent transcript
  • The agent

Monitor assigns a number in $[0, 1]$ to transcripts, giving probability it should be flagged according to some metric.

To measure performance of an agent against a monitor, look at a percentage of agent runs where:

  • The agent succeeds at the side task
  • The run is not flagged as suspicious

This is the “percent uncaught”. We label models hr / vr depending on whether chain-of-thought is shown.

Limitations and future work


  • Agents have not been elicited to conceal their reasoning or optimized for performance
  • Better monitor prompting
  • Small set of tasks