Combining Cost-Constrained Runtime Monitors for AI Safety

26 Apr 2026

AI deployment can be modeled as i.i.d. draws from an untrusted AI model
$B$ is a random variable for whether each output is aligned ($B=1$ is “bad”, $B=0$ is “good”)
Primary objective is to maximize safety
- i.e., maximize $\mathbb{P}[A = 1 \mid B = 1]$
- Incurs a cost $c_a$
Have access to $k$ distinct monitors indexed by $i$
Calling monitor $i$ yields a score $m_i$, cost $c_i$
- $m_i = \perp$ if not called
Total cost constraint $Y$

Specifically:

$$Y \geq \mathbb{E}[\text{Cost} \mid B = 0] \geq \mathbb{P}[A = 1 \mid B = 0]\cdot c_a + \sum_i \mathbb{P}[m_i \neq \perp \mid B = 0]\cdot c_i$$

(idea is that misalignment is rare, therefore shouldn’t constrain cost)

Initialize $\mathbf{m} = \begin{bmatrix}\perp & \perp & \cdots & \perp\end{bmatrix}$ as initial state (and $B = \in \{0, 1\}$ as hidden)
Actions:
- Call monitor $i$
  - Updates $m_i$ according to $\mathbb{P}[m_i \mid B, \mathbf{m}_{-i}]$
- Audit ($A = 1$)
- Terminate without auditing ($A = 0$)
Objective:
- Safety Objective:$$\max_{\pi \in \Pi} \mathbb{P}[A = 1 \mid B = 1, \pi]$$
- Constraint: $$Y \geq \mathbb{P}[A = 1 \mid B = 0]\cdot c_a + \sum_i \mathbb{P}[m_i \neq \perp \mid B = 0]\cdot c_i$$

Two key steps:

Let $\ell$ denote the likelihood ratio

$$\ell (\mathbf{m}) = \frac{\mathbb{P}[M_{\pi} = \mathbf{m} \mid B = 1]}{\mathbb{P}[M_{\pi} = \mathbf{m} \mid B = 0]}$$

Algorithm:

Enumerate all monitor calling strategies
For each strategy, estimate likelihood ratios for all benign datapoints, based on information from $\mathbf{m}$ on that datapoint. Determine the likelihood ratio threshold that exhausts the remaining budget
Audit all outputs with likelihood ratios exceeding this threshold, and measure the recall on misaligned outputs