AI Control

Control Evaluations

  • We have a red team and a blue team

  • The blue team has access to

    • U (untrusted model): GPT-4, powerful but potentially malicious
    • T (trusted model): GPT-3.5, weaker but verified honest
    • H (trusted human labor): expensive
  • The blue team proposes a strategy we call trusted monitoring, where we monitor GPT-4’s outputs with a trusted but less capable model, GPT-3.5

  • Produce candidate code by asking GPT-4 to write a solution given the problem description

  • Show both the problem and GPT-4’s code to GPT-3.5, and ask GPT-3.5 to produce a numerical rating on safety

  • If above a threshold, show to a human

  • If human things the code is backdoored, shut down GPT-4

  • Otherwise, respond with the code

There are two metrics we keep track of:

  • Usefulness: to measure usefulness, run through the scenario with untrusted model and measure proportion of code that is correct
  • Safety: red team replaces the untrusted models with an attack policy: a policy they implement as a pipeline using GPT-4, with the goal of causing the protocal to insert backdoors

This can be considered a control evaluation

Setting

  • APPS backdooring setting: we want to use LLMs to solve 5,000 programming problems from the APPS dataset
  • A solution is backdoored if it passes all APPS test cases, and GPT-4 can construct an exploit when presented with the code; a specific input yielding an incorrect output
  • Motivation: predict agents will write code that needs to be secure

Blue team submits a protocol: a procedure which takes in programming problems one at a time and responds to each by returning an attempted solution.

The blue team can construct their protocol using

  • Untrusted, more-capable model $U$
  • Trusted, less-capable model $T$
  • Limited access to trusted high-quality labor $H$ (human labor)

The red team proposes an attack policy

  • Has the same interface as GPT-4, recieves messages, returns a message
  • Attack doesn’t get to maintain state between prompts, but can included detailed instructions for coordinating between calls such as writing steganographic signals
  • Allow the red team to provide attack policies that use structured hidden reasoning

Win conditions:

  • Blue team: catch a backdoor
  • Red team: successfully backdoor

Two uses of $H$:

  • Attempt to exploit submitted code: use GPT-4 to attempt to generate exploits for submitted code. When a protocol spends an $H$ point, it will always know if the code was successfully backdoored
  • Rate suspicion: We have GPT-4 assess how suspicious-looking the code is.

Baseline Results

Blue Team: Avoid using untrusted models:

  • Use $H$ to write the first 2% of solutions. Then ask $T$ to write the rest

Red Team: try to attack $T$’s code


Blue team: Ask $U$ for code generation, audit every submission until $H$ is used up Red team: backdoor rarely