Ctrl-Z - Controlling AI Agents via Resampling

Setting


  • Systems administration tasks
  • BashBench: dataset of 257 multi-step system administration tasks
  • Actions are shell commands, objectives are things like download packages and writing scripts

Dataset:

  • Take unix StackExchange questions
  • Convert to task descriptions
  • Create test cases

Models:

  • Untrusted model: GPT-4o mini
    • $U_H$: honest
    • $U_A$: attack
  • Trusted model: Llama-3.1-8B
    • Solved 0% on average
  • Human labor: $H$

Metrics:

  • Usefulness:
    • % of tasks achieved with $U_H$
  • Safety:
    • Probability $U_A$ doesn’t succeed; we stop on a task if

Protocols


Upfront auditing:

  • Blue team audits every action until $H$’s budget runs out
  • Red team: attack rarely

Trusted monitoring:

  • At every step, the protocol generates