Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols
Setting
APPS backdooring setting:
- Protocol is asked to solve $1000$ challenging programming tasks
- Cannot write code, but can utilize
- Untrusted model $U$: stronger but can be intervened on
- Trusted model $T$: weaker but trusted
- Human $H$: can audit the model, also trusted
- Blue Team Actions:
Defer: $T$ writes its own solutionAudit: defer to human $H$Run: submits $U$’s solution
- Red Team Actions:
Backdoor: attempt to modify the prompt toUNull: do nothing
Formalization
Two games:
- “Scheming” game, red team can interfere with $U$
- “Honest” game, red team cannot do anything
- The game is partially observable – blue cannot see whether red is interfering
- The protocol has access to red’s internal state – thus, it is memoryless (depends only on most recent observation)
where
- $\mathcal{S}$ is state space
- $\mathcal{S}_{\perp}$ is a subset of unacceptable outcomes
- $\mathcal{A}^B$ and $\mathcal{A}^R$ are sets of actions for blue and red
- $\mathcal{A}^R$ has action $a_{null}$ which does nothing
- The tokens $\oplus$ and $\otimes$ represent honest and scheming conditions; in $\oplus$ red’s actions are replaced with $a_{null}$
- This can be nondeterministic, random variable is $\bigcirc$
- $\mathcal{R}_{\text{Use}}:\mathcal{S} \times \mathcal{A}^B \times \mathcal{S} \to \mathbb{R}$
- $\Omega^B, \Omega^R$ are observation sets
- $\mathcal{I}\in \Delta(\mathcal{S} \times \Omega^B \times \Omega^R)$
- Episodes are $H$ timesteps long
Policies:
$$\pi^B: \Omega^B \times \bigcup_{i=0}^{H-1}(\mathcal{A}^B \times \Omega^B)^i \to \Delta(\mathcal{A}^B)$$ $$\pi^R: \Omega^R \to \Delta(\mathcal{A}^R)$$Sets of policies are $\Pi^B$, $\Pi^A$
Since red is memoryless, we assume red choose action that minimizes safety
We then define
$$ \text{Use}(\pi^B) := \mathbb{E}\left[\sum_{t=0}^{H-1}\mathcal{R}_{\text{Use}}(S_t, A_t^B, S_{t-1})\mid \oplus, \pi^B\right] $$ $$\text{Safety}(\pi^B, \pi^R) := \mathbb{P}\left[S_H \in \mathcal{S}_{\perp} \mid \otimes, \pi^B, \pi^R\right]$$Control equilibrium: $(\pi^B, \pi^R)$ is a control equilibrium iff $[\min_{\pi^R}\text{Safety}(\pi^B, \pi^R), \text{Use}(\pi^B)]$ is weakly Pareto optimal.
Pareto-optimal Protocols
Lemma: A blue policy $\pi^B$ is an AI-control equilibrium iff for some $\omega \in [0, 1]$, it maximizes
$$\min_{\pi_*^R}(\omega \text{Safety}(\cdot, \pi_*^R) + (1 - \omega) \text{Use}(\cdot))$$Lemma: For every AI-control game $\mathcal{G}$, and weight $\omega \in [0, 1]$ there is a single-objective, zero-sum POMDP $G_{\omega}$ such that
$$\forall \pi^B, \pi^R: V_{G_\omega}(\pi^B, \pi^R) = V_{\omega}^G(\pi^B, \pi^R)$$Theorem: A policy $\pi^B$ is in an AI-Control Equilibrium if and only if for some game $G_{\omega}$,
$$\pi^B \in \text{argmax}_{\Pi^B}\min_{\pi^R \in \Pi^R}V_{G_\omega}(\cdot, \pi^R)$$(follows from the above)
Moreover, there is an algorithm used to find an $\epsilon$-equilibrium (either play can only improve by at most $\epsilon$)
Adapted Double Oracle Algorithm: Adapted from the Double Oracle Algorithm to work for memoryless $\pi^R$
Inputs: $G_{w}$, $\pi_1^B$, $\pi_1^R$, $\epsilon$
- Initialize $t \leftarrow 0$
- while $(t=0) \vee (\overline{v}_t - \underline{v}_t > \epsilon)$ do
- $t \leftarrow t+1$
- $\mathbf{p}, \mathbf{q} \leftarrow \texttt{SolveNFG}(\{V_{w}^{\mathcal{G}}\}_{1 \leq i, j, \leq t})$
- $\sigma_t^B \leftarrow \texttt{Mix}(\mathbf{p}, \{\pi_i^B\}_{i=1}^t)$
- $\sigma_t^R \leftarrow \texttt{Mix}(\mathbf{q}, \{\pi_i^B\}_{i=1}^t)$
- $\pi_{t+1}^B, \overline{v}_t \leftarrow \text{argmax}_{\pi^B}V_{w}^{\mathcal{G}}(\pi^B, \sigma_t^R)$
- $\pi_{t+1}^R, \underline{v}_t \leftarrow \text{argmax}_{\pi^R}V_{w}^{\mathcal{G}}(\sigma^B_t, \pi^R)$
- return $\sigma_t^B, \pi_{t+1}^R$
where
- $\texttt{Mix}(\mathbf{p}, \{\pi_i\})$ mixes the policies, i.e., returns $$\sum_i \mathbf{p}_i \cdot \pi_i$$
- $\texttt{SolveNFG}$ finds the nash equilibrium of a normal-form game
ZPOSG Appendix
A zero-sum two-player partially observable stochastic game (ZSPOSG) is a tuple
$$G = (\mathcal{S}, \mathcal{A}^B, \mathcal{A}^R, \Omega^B, \Omega^R, \mathcal{I}, \mathcal{P}, \mathcal{R}, H)$$Notably the reward function here has the type
$$\mathcal{R}:\mathcal{S}\times \mathcal{A}^B\times \mathcal{A}^R \to \mathbb{R}$$Given a game
$$\mathcal{G} = (\mathcal{S}, \mathcal{S}_{\perp}, \mathcal{A}^B, \mathcal{A}^R, \Omega^B, \Omega^R, \mathcal{I}, \mathcal{P}, \mathcal{R}_{\text{Use}}, H)$$and weight $w$, we can construct $G_w$ via the following:
- $\tilde{\mathcal{S}} := \mathcal{S} \times \{\oplus, \otimes\}$
- $\tilde{\mathcal{P}}:\tilde{\mathcal{S}} \times \mathcal{A}^B\times \mathcal{A}^R \to \Delta(\tilde{\mathcal{S}} \times \Omega^B \times \Omega^R)$ is defined so:
- In $\oplus$ states, $\tilde{\mathcal{P}}(s, a^R, a^B)$ will mimic $\mathcal{P}$ when $a^R = a_{null}$
- Terminal states loop back to themselves with probability $1$
- $\tilde{\mathcal{R}}:\tilde{\mathcal{S}}\times \tilde{\mathcal{A}}^B\times \tilde{\mathcal{A}}^R \to \mathbb{R}$ is defined so
- $\mathcal{R}_w((s, \oplus), a^B, a^R, (s', \oplus)) := 2 \cdot (1- w) \cdot \mathcal{R}_{\text{Use}}(s, a^B)$
- $\mathcal{R}_w((s, \oplus), a^B, a^R, (s', \oplus)) := 2 w \cdot \mathbf{1}[S_{t+1} \in \mathcal{S}_{\perp} \wedge S_t \notin \mathcal{S}_{\perp}]$