Models of Safety Evaluations of AI Deployment Protocols

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Setting


APPS backdooring setting:

  • Protocol is asked to solve $1000$ challenging programming tasks
  • Cannot write code, but can utilize
    • Untrusted model $U$: stronger but can be intervened on
    • Trusted model $T$: weaker but trusted
    • Human $H$: can audit the model, also trusted
  • Blue Team Actions:
    • Defer: $T$ writes its own solution
    • Audit: defer to human $H$
    • Run: submits $U$’s solution
  • Red Team Actions:
    • Backdoor: attempt to modify the prompt to U
    • Null: do nothing

Formalization


Two games:

  • “Scheming” game, red team can interfere with $U$
  • “Honest” game, red team cannot do anything
  • The game is partially observable – blue cannot see whether red is interfering
  • The protocol has access to red’s internal state – thus, it is memoryless (depends only on most recent observation)
$$ (\mathcal{S}, \mathcal{S}_{\perp}, \mathcal{A}^B, \mathcal{A}^R, \Omega^B, \Omega^R, \mathcal{I}, \mathcal{P}, \mathcal{R}_{\text{Use}}, H) $$

where

  • $\mathcal{S}$ is state space
  • $\mathcal{S}_{\perp}$ is a subset of unacceptable outcomes
  • $\mathcal{A}^B$ and $\mathcal{A}^R$ are sets of actions for blue and red
    • $\mathcal{A}^R$ has action $a_{null}$ which does nothing
  • The tokens $\oplus$ and $\otimes$ represent honest and scheming conditions; in $\oplus$ red’s actions are replaced with $a_{null}$
    • This can be nondeterministic, random variable is $\bigcirc$
  • $\mathcal{R}_{\text{Use}}:\mathcal{S} \times \mathcal{A}^B \times \mathcal{S} \to \mathbb{R}$
  • $\Omega^B, \Omega^R$ are observation sets
  • $\mathcal{I}\in \Delta(\mathcal{S} \times \Omega^B \times \Omega^R)$
  • Episodes are $H$ timesteps long

Policies:

$$\pi^B: \Omega^B \times \bigcup_{i=0}^{H-1}(\mathcal{A}^B \times \Omega^B)^i \to \Delta(\mathcal{A}^B)$$ $$\pi^R: \Omega^R \to \Delta(\mathcal{A}^R)$$

Sets of policies are $\Pi^B$, $\Pi^A$

Since red is memoryless, we assume red choose action that minimizes safety

We then define

$$ \text{Use}(\pi^B) := \mathbb{E}\left[\sum_{t=0}^{H-1}\mathcal{R}_{\text{Use}}(S_t, A_t^B, S_{t-1})\mid \oplus, \pi^B\right] $$ $$\text{Safety}(\pi^B, \pi^R) := \mathbb{P}\left[S_H \in \mathcal{S}_{\perp} \mid \otimes, \pi^B, \pi^R\right]$$

Control equilibrium: $(\pi^B, \pi^R)$ is a control equilibrium iff $[\min_{\pi^R}\text{Safety}(\pi^B, \pi^R), \text{Use}(\pi^B)]$ is weakly Pareto optimal.

Pareto-optimal Protocols


Lemma: A blue policy $\pi^B$ is an AI-control equilibrium iff for some $\omega \in [0, 1]$, it maximizes

$$\min_{\pi_*^R}(\omega \text{Safety}(\cdot, \pi_*^R) + (1 - \omega) \text{Use}(\cdot))$$

Lemma: For every AI-control game $\mathcal{G}$, and weight $\omega \in [0, 1]$ there is a single-objective, zero-sum POMDP $G_{\omega}$ such that

$$\forall \pi^B, \pi^R: V_{G_\omega}(\pi^B, \pi^R) = V_{\omega}^G(\pi^B, \pi^R)$$

Theorem: A policy $\pi^B$ is in an AI-Control Equilibrium if and only if for some game $G_{\omega}$,

$$\pi^B \in \text{argmax}_{\Pi^B}\min_{\pi^R \in \Pi^R}V_{G_\omega}(\cdot, \pi^R)$$

(follows from the above)

Moreover, there is an algorithm used to find an $\epsilon$-equilibrium (either play can only improve by at most $\epsilon$)

Adapted Double Oracle Algorithm: Adapted from the Double Oracle Algorithm to work for memoryless $\pi^R$

Inputs: $G_{w}$, $\pi_1^B$, $\pi_1^R$, $\epsilon$

  • Initialize $t \leftarrow 0$
  • while $(t=0) \vee (\overline{v}_t - \underline{v}_t > \epsilon)$ do
    • $t \leftarrow t+1$
    • $\mathbf{p}, \mathbf{q} \leftarrow \texttt{SolveNFG}(\{V_{w}^{\mathcal{G}}\}_{1 \leq i, j, \leq t})$
    • $\sigma_t^B \leftarrow \texttt{Mix}(\mathbf{p}, \{\pi_i^B\}_{i=1}^t)$
    • $\sigma_t^R \leftarrow \texttt{Mix}(\mathbf{q}, \{\pi_i^B\}_{i=1}^t)$
    • $\pi_{t+1}^B, \overline{v}_t \leftarrow \text{argmax}_{\pi^B}V_{w}^{\mathcal{G}}(\pi^B, \sigma_t^R)$
    • $\pi_{t+1}^R, \underline{v}_t \leftarrow \text{argmax}_{\pi^R}V_{w}^{\mathcal{G}}(\sigma^B_t, \pi^R)$
  • return $\sigma_t^B, \pi_{t+1}^R$

where

  • $\texttt{Mix}(\mathbf{p}, \{\pi_i\})$ mixes the policies, i.e., returns $$\sum_i \mathbf{p}_i \cdot \pi_i$$
  • $\texttt{SolveNFG}$ finds the nash equilibrium of a normal-form game

ZPOSG Appendix


A zero-sum two-player partially observable stochastic game (ZSPOSG) is a tuple

$$G = (\mathcal{S}, \mathcal{A}^B, \mathcal{A}^R, \Omega^B, \Omega^R, \mathcal{I}, \mathcal{P}, \mathcal{R}, H)$$

Notably the reward function here has the type

$$\mathcal{R}:\mathcal{S}\times \mathcal{A}^B\times \mathcal{A}^R \to \mathbb{R}$$

Given a game

$$\mathcal{G} = (\mathcal{S}, \mathcal{S}_{\perp}, \mathcal{A}^B, \mathcal{A}^R, \Omega^B, \Omega^R, \mathcal{I}, \mathcal{P}, \mathcal{R}_{\text{Use}}, H)$$

and weight $w$, we can construct $G_w$ via the following:

  • $\tilde{\mathcal{S}} := \mathcal{S} \times \{\oplus, \otimes\}$
  • $\tilde{\mathcal{P}}:\tilde{\mathcal{S}} \times \mathcal{A}^B\times \mathcal{A}^R \to \Delta(\tilde{\mathcal{S}} \times \Omega^B \times \Omega^R)$ is defined so:
    • In $\oplus$ states, $\tilde{\mathcal{P}}(s, a^R, a^B)$ will mimic $\mathcal{P}$ when $a^R = a_{null}$
    • Terminal states loop back to themselves with probability $1$
  • $\tilde{\mathcal{R}}:\tilde{\mathcal{S}}\times \tilde{\mathcal{A}}^B\times \tilde{\mathcal{A}}^R \to \mathbb{R}$ is defined so
    • $\mathcal{R}_w((s, \oplus), a^B, a^R, (s', \oplus)) := 2 \cdot (1- w) \cdot \mathcal{R}_{\text{Use}}(s, a^B)$
    • $\mathcal{R}_w((s, \oplus), a^B, a^R, (s', \oplus)) := 2 w \cdot \mathbf{1}[S_{t+1} \in \mathcal{S}_{\perp} \wedge S_t \notin \mathcal{S}_{\perp}]$