Ctrl-Z - Controlling AI Agents via Resampling

15 Apr 2026

Setting

Systems administration tasks
BashBench: dataset of 257 multi-step system administration tasks
Actions are shell commands, objectives are things like download packages and writing scripts

Dataset:

Take unix StackExchange questions
Convert to task descriptions
Create test cases

Models:

Untrusted model: GPT-4o mini
- $U_H$: honest
- $U_A$: attack
Trusted model: Llama-3.1-8B
- Solved 0% on average
Human labor: $H$

Metrics:

Usefulness:
- % of tasks achieved with $U_H$
Safety:
- Probability $U_A$ doesn’t succeed; we stop on a task if

Protocols

Upfront auditing:

Blue team audits every action until $H$’s budget runs out
Red team: attack rarely

Trusted monitoring:

At every step, the protocol generates