Setting
- Systems administration tasks
- BashBench: dataset of 257 multi-step system administration tasks
- Actions are shell commands, objectives are things like download packages and writing scripts
Dataset:
- Take unix StackExchange questions
- Convert to task descriptions
- Create test cases
Models:
- Untrusted model: GPT-4o mini
- $U_H$: honest
- $U_A$: attack
- Trusted model: Llama-3.1-8B
- Solved 0% on average
- Human labor: $H$
Metrics:
- Usefulness:
- % of tasks achieved with $U_H$
- Safety:
- Probability $U_A$ doesn’t succeed; we stop on a task if
Protocols
Upfront auditing:
- Blue team audits every action until $H$’s budget runs out
- Red team: attack rarely
Trusted monitoring:
- At every step, the protocol generates