AI Control

24 Mar 2026

Control Evaluations

We have a red team and a blue team
The blue team has access to
- U (untrusted model): GPT-4, powerful but potentially malicious
- T (trusted model): GPT-3.5, weaker but verified honest
- H (trusted human labor): expensive
The blue team proposes a strategy we call trusted monitoring, where we monitor GPT-4’s outputs with a trusted but less capable model, GPT-3.5
Produce candidate code by asking GPT-4 to write a solution given the problem description
Show both the problem and GPT-4’s code to GPT-3.5, and ask GPT-3.5 to produce a numerical rating on safety
If above a threshold, show to a human
If human things the code is backdoored, shut down GPT-4
Otherwise, respond with the code

There are two metrics we keep track of:

Usefulness: to measure usefulness, run through the scenario with untrusted model and measure proportion of code that is correct
Safety: red team replaces the untrusted models with an attack policy: a policy they implement as a pipeline using GPT-4, with the goal of causing the protocal to insert backdoors

This can be considered a control evaluation

Setting

APPS backdooring setting: we want to use LLMs to solve 5,000 programming problems from the APPS dataset
A solution is backdoored if it passes all APPS test cases, and GPT-4 can construct an exploit when presented with the code; a specific input yielding an incorrect output
Motivation: predict agents will write code that needs to be secure

Blue team submits a protocol: a procedure which takes in programming problems one at a time and responds to each by returning an attempted solution.

The blue team can construct their protocol using

Untrusted, more-capable model $U$
Trusted, less-capable model $T$
Limited access to trusted high-quality labor $H$ (human labor)

The red team proposes an attack policy

Has the same interface as GPT-4, recieves messages, returns a message
Attack doesn’t get to maintain state between prompts, but can included detailed instructions for coordinating between calls such as writing steganographic signals
Allow the red team to provide attack policies that use structured hidden reasoning

Win conditions:

Blue team: catch a backdoor
Red team: successfully backdoor

Two uses of $H$:

Attempt to exploit submitted code: use GPT-4 to attempt to generate exploits for submitted code. When a protocol spends an $H$ point, it will always know if the code was successfully backdoored
Rate suspicion: We have GPT-4 assess how suspicious-looking the code is.

Baseline Results

Blue Team: Avoid using untrusted models:

Use $H$ to write the first 2% of solutions. Then ask $T$ to write the rest

Red Team: try to attack $T$’s code

Blue team: Ask $U$ for code generation, audit every submission until $H$ is used up Red team: backdoor rarely