Mathematical Framework
- AI deployment can be modeled as i.i.d. draws from an untrusted AI model
- $B$ is a random variable for whether each output is aligned ($B=1$ is “bad”, $B=0$ is “good”)
- Primary objective is to maximize safety
- i.e., maximize $\mathbb{P}[A = 1 \mid B = 1]$
- Incurs a cost $c_a$
- Have access to $k$ distinct monitors indexed by $i$
- Calling monitor $i$ yields a score $m_i$, cost $c_i$
- $m_i = \perp$ if not called
- Total cost constraint $Y$
Specifically:
$$Y \geq \mathbb{E}[\text{Cost} \mid B = 0] \geq \mathbb{P}[A = 1 \mid B = 0]\cdot c_a + \sum_i \mathbb{P}[m_i \neq \perp \mid B = 0]\cdot c_i$$(idea is that misalignment is rare, therefore shouldn’t constrain cost)
Monitoring as Constrained POMDP
- Initialize $\mathbf{m} = \begin{bmatrix}\perp & \perp & \cdots & \perp\end{bmatrix}$ as initial state (and $B = \in \{0, 1\}$ as hidden)
- Actions:
- Call monitor $i$
- Updates $m_i$ according to $\mathbb{P}[m_i \mid B, \mathbf{m}_{-i}]$
- Audit ($A = 1$)
- Terminate without auditing ($A = 0$)
- Call monitor $i$
- Objective:
- Safety Objective:$$\max_{\pi \in \Pi} \mathbb{P}[A = 1 \mid B = 1, \pi]$$
- Constraint: $$Y \geq \mathbb{P}[A = 1 \mid B = 0]\cdot c_a + \sum_i \mathbb{P}[m_i \neq \perp \mid B = 0]\cdot c_i$$
Policy Optimization
Two key steps:
- Determining which monitors to call and when
- Deciding which outputs to audit based on observed monitor scores
Let $\ell$ denote the likelihood ratio
$$\ell (\mathbf{m}) = \frac{\mathbb{P}[M_{\pi} = \mathbf{m} \mid B = 1]}{\mathbb{P}[M_{\pi} = \mathbf{m} \mid B = 0]}$$Algorithm:
- Enumerate all monitor calling strategies
- For each strategy, estimate likelihood ratios for all benign datapoints, based on information from $\mathbf{m}$ on that datapoint. Determine the likelihood ratio threshold that exhausts the remaining budget
- Audit all outputs with likelihood ratios exceeding this threshold, and measure the recall on misaligned outputs