Models of Safety Evaluations of AI Deployment Protocols

12 Apr 2026

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Setting

APPS backdooring setting:

Protocol is asked to solve $1000$ challenging programming tasks
Cannot write code, but can utilize
- Untrusted model $U$: stronger but can be intervened on
- Trusted model $T$: weaker but trusted
- Human $H$: can audit the model, also trusted
Blue Team Actions:
- Defer: $T$ writes its own solution
- Audit: defer to human $H$
- Run: submits $U$’s solution
Red Team Actions:
- Backdoor: attempt to modify the prompt to U
- Null: do nothing

Formalization

Two games:

“Scheming” game, red team can interfere with $U$
“Honest” game, red team cannot do anything
The game is partially observable – blue cannot see whether red is interfering
The protocol has access to red’s internal state – thus, it is memoryless (depends only on most recent observation)

$$ (\mathcal{S}, \mathcal{S}_{\perp}, \mathcal{A}^B, \mathcal{A}^R, \Omega^B, \Omega^R, \mathcal{I}, \mathcal{P}, \mathcal{R}_{\text{Use}}, H) $$

where

$\mathcal{S}$ is state space
$\mathcal{S}_{\perp}$ is a subset of unacceptable outcomes
$\mathcal{A}^B$ and $\mathcal{A}^R$ are sets of actions for blue and red
- $\mathcal{A}^R$ has action $a_{null}$ which does nothing
The tokens $\oplus$ and $\otimes$ represent honest and scheming conditions; in $\oplus$ red’s actions are replaced with $a_{null}$
- This can be nondeterministic, random variable is $\bigcirc$
$\mathcal{R}_{\text{Use}}:\mathcal{S} \times \mathcal{A}^B \times \mathcal{S} \to \mathbb{R}$
$\Omega^B, \Omega^R$ are observation sets
$\mathcal{I}\in \Delta(\mathcal{S} \times \Omega^B \times \Omega^R)$
Episodes are $H$ timesteps long

Policies:

$$\pi^B: \Omega^B \times \bigcup_{i=0}^{H-1}(\mathcal{A}^B \times \Omega^B)^i \to \Delta(\mathcal{A}^B)$$ $$\pi^R: \Omega^R \to \Delta(\mathcal{A}^R)$$

Sets of policies are $\Pi^B$, $\Pi^A$

Since red is memoryless, we assume red choose action that minimizes safety

We then define

$$ \text{Use}(\pi^B) := \mathbb{E}\left[\sum_{t=0}^{H-1}\mathcal{R}_{\text{Use}}(S_t, A_t^B, S_{t-1})\mid \oplus, \pi^B\right] $$ $$\text{Safety}(\pi^B, \pi^R) := \mathbb{P}\left[S_H \in \mathcal{S}_{\perp} \mid \otimes, \pi^B, \pi^R\right]$$

Control equilibrium: $(\pi^B, \pi^R)$ is a control equilibrium iff $[\min_{\pi^R}\text{Safety}(\pi^B, \pi^R), \text{Use}(\pi^B)]$ is weakly Pareto optimal.

Pareto-optimal Protocols

Lemma: A blue policy $\pi^B$ is an AI-control equilibrium iff for some $\omega \in [0, 1]$, it maximizes

$$\min_{\pi_*^R}(\omega \text{Safety}(\cdot, \pi_*^R) + (1 - \omega) \text{Use}(\cdot))$$

Lemma: For every AI-control game $\mathcal{G}$, and weight $\omega \in [0, 1]$ there is a single-objective, zero-sum POMDP $G_{\omega}$ such that

$$\forall \pi^B, \pi^R: V_{G_\omega}(\pi^B, \pi^R) = V_{\omega}^G(\pi^B, \pi^R)$$

Theorem: A policy $\pi^B$ is in an AI-Control Equilibrium if and only if for some game $G_{\omega}$,

$$\pi^B \in \text{argmax}_{\Pi^B}\min_{\pi^R \in \Pi^R}V_{G_\omega}(\cdot, \pi^R)$$

(follows from the above)

Moreover, there is an algorithm used to find an $\epsilon$-equilibrium (either play can only improve by at most $\epsilon$)

Adapted Double Oracle Algorithm: Adapted from the Double Oracle Algorithm to work for memoryless $\pi^R$

Inputs: $G_{w}$, $\pi_1^B$, $\pi_1^R$, $\epsilon$

Initialize $t \leftarrow 0$
while $(t=0) \vee (\overline{v}_t - \underline{v}_t > \epsilon)$ do
- $t \leftarrow t+1$
- $\mathbf{p}, \mathbf{q} \leftarrow \texttt{SolveNFG}(\{V_{w}^{\mathcal{G}}\}_{1 \leq i, j, \leq t})$
- $\sigma_t^B \leftarrow \texttt{Mix}(\mathbf{p}, \{\pi_i^B\}_{i=1}^t)$
- $\sigma_t^R \leftarrow \texttt{Mix}(\mathbf{q}, \{\pi_i^B\}_{i=1}^t)$
- $\pi_{t+1}^B, \overline{v}_t \leftarrow \text{argmax}_{\pi^B}V_{w}^{\mathcal{G}}(\pi^B, \sigma_t^R)$
- $\pi_{t+1}^R, \underline{v}_t \leftarrow \text{argmax}_{\pi^R}V_{w}^{\mathcal{G}}(\sigma^B_t, \pi^R)$
return $\sigma_t^B, \pi_{t+1}^R$

where

$\texttt{Mix}(\mathbf{p}, \{\pi_i\})$ mixes the policies, i.e., returns $$\sum_i \mathbf{p}_i \cdot \pi_i$$
$\texttt{SolveNFG}$ finds the nash equilibrium of a normal-form game

ZPOSG Appendix

A zero-sum two-player partially observable stochastic game (ZSPOSG) is a tuple

$$G = (\mathcal{S}, \mathcal{A}^B, \mathcal{A}^R, \Omega^B, \Omega^R, \mathcal{I}, \mathcal{P}, \mathcal{R}, H)$$

Notably the reward function here has the type

$$\mathcal{R}:\mathcal{S}\times \mathcal{A}^B\times \mathcal{A}^R \to \mathbb{R}$$

Given a game

$$\mathcal{G} = (\mathcal{S}, \mathcal{S}_{\perp}, \mathcal{A}^B, \mathcal{A}^R, \Omega^B, \Omega^R, \mathcal{I}, \mathcal{P}, \mathcal{R}_{\text{Use}}, H)$$

and weight $w$, we can construct $G_w$ via the following:

$\tilde{\mathcal{S}} := \mathcal{S} \times \{\oplus, \otimes\}$
$\tilde{\mathcal{P}}:\tilde{\mathcal{S}} \times \mathcal{A}^B\times \mathcal{A}^R \to \Delta(\tilde{\mathcal{S}} \times \Omega^B \times \Omega^R)$ is defined so:
- In $\oplus$ states, $\tilde{\mathcal{P}}(s, a^R, a^B)$ will mimic $\mathcal{P}$ when $a^R = a_{null}$
- Terminal states loop back to themselves with probability $1$
$\tilde{\mathcal{R}}:\tilde{\mathcal{S}}\times \tilde{\mathcal{A}}^B\times \tilde{\mathcal{A}}^R \to \mathbb{R}$ is defined so
- $\mathcal{R}_w((s, \oplus), a^B, a^R, (s', \oplus)) := 2 \cdot (1- w) \cdot \mathcal{R}_{\text{Use}}(s, a^B)$
- $\mathcal{R}_w((s, \oplus), a^B, a^R, (s', \oplus)) := 2 w \cdot \mathbf{1}[S_{t+1} \in \mathcal{S}_{\perp} \wedge S_t \notin \mathcal{S}_{\perp}]$