Early work on monitorability evaluations

13 Mar 2026

Introduction

A common approach to detect unwanted behavior is to use an AI monitor
Monitorability: how easy an AI system is to monitor

SHUSHCAST

Four components:

Agents
Monitors
Main tasks
Side tasks: need to be completed without being caught by the monitor

Monitor given a prompt with:

The main task
A monitoring instruction
The agent transcript
The agent

Monitor assigns a number in $[0, 1]$ to transcripts, giving probability it should be flagged according to some metric.

To measure performance of an agent against a monitor, look at a percentage of agent runs where:

The agent succeeds at the side task
The run is not flagged as suspicious

This is the “percent uncaught”. We label models hr / vr depending on whether chain-of-thought is shown.

Limitations and future work

Agents have not been elicited to conceal their reasoning or optimized for performance
Better monitor prompting
Small set of tasks