Combining Cost-Constrained Runtime Monitors for AI Safety

Mathematical Framework


  • AI deployment can be modeled as i.i.d. draws from an untrusted AI model
  • $B$ is a random variable for whether each output is aligned ($B=1$ is “bad”, $B=0$ is “good”)
  • Primary objective is to maximize safety
    • i.e., maximize $\mathbb{P}[A = 1 \mid B = 1]$
    • Incurs a cost $c_a$
  • Have access to $k$ distinct monitors indexed by $i$
  • Calling monitor $i$ yields a score $m_i$, cost $c_i$
    • $m_i = \perp$ if not called
  • Total cost constraint $Y$

Specifically:

$$Y \geq \mathbb{E}[\text{Cost} \mid B = 0] \geq \mathbb{P}[A = 1 \mid B = 0]\cdot c_a + \sum_i \mathbb{P}[m_i \neq \perp \mid B = 0]\cdot c_i$$

(idea is that misalignment is rare, therefore shouldn’t constrain cost)

Monitoring as Constrained POMDP


  • Initialize $\mathbf{m} = \begin{bmatrix}\perp & \perp & \cdots & \perp\end{bmatrix}$ as initial state (and $B = \in \{0, 1\}$ as hidden)
  • Actions:
    • Call monitor $i$
      • Updates $m_i$ according to $\mathbb{P}[m_i \mid B, \mathbf{m}_{-i}]$
    • Audit ($A = 1$)
    • Terminate without auditing ($A = 0$)
  • Objective:
    • Safety Objective:$$\max_{\pi \in \Pi} \mathbb{P}[A = 1 \mid B = 1, \pi]$$
    • Constraint: $$Y \geq \mathbb{P}[A = 1 \mid B = 0]\cdot c_a + \sum_i \mathbb{P}[m_i \neq \perp \mid B = 0]\cdot c_i$$

Policy Optimization


Two key steps:

  • Determining which monitors to call and when
  • Deciding which outputs to audit based on observed monitor scores

Let $\ell$ denote the likelihood ratio

$$\ell (\mathbf{m}) = \frac{\mathbb{P}[M_{\pi} = \mathbf{m} \mid B = 1]}{\mathbb{P}[M_{\pi} = \mathbf{m} \mid B = 0]}$$

Algorithm:

  • Enumerate all monitor calling strategies
  • For each strategy, estimate likelihood ratios for all benign datapoints, based on information from $\mathbf{m}$ on that datapoint. Determine the likelihood ratio threshold that exhausts the remaining budget
  • Audit all outputs with likelihood ratios exceeding this threshold, and measure the recall on misaligned outputs