DOPE Algorithm

14 Mar 2026

Preliminaries

Constrained MDPs

Constrained MDP:

$$\langle \mathcal{S}, \mathcal{A}, r, c, P, H, \overline{C}\rangle$$

with objective

$$\min_{\pi}V_r^{\pi}(P) \ \ \text{ s.t. } \ \ V_c^{\pi}(P)\leq \overline{C}$$

Assumption: The algorithm knows a safe baseline $\pi_b$ such that $V_c^{\pi}(P) < \overline{C}$

Empirical transitions

$$\hat{P}_{h, k}(s' \mid s, a) = \frac{n(s, a, s')}{\min(n(s, a), 1)}$$ $$\hat{r}_{h, k}(s, a) = \frac{1}{n_{h, k}(s, a) \vee 1} \cdot \sum_{k' = 1}^{k-1}r_h(s, a)\cdot \mathbf{1}\{s_{h, k'} = s, a_{h, k'} = a\}$$ $$\hat{c}_{h, k}(s, a) = \frac{1}{n_{h, k}(s, a) \vee 1} \cdot \sum_{k' = 1}^{k-1}c_h(s, a)\cdot \mathbf{1}\{s_{h, k'} = s, a_{h, k'} = a\}$$

Confidence sets

We construct

$$\mathcal{P}_k = \bigcap\mathcal{P}_k(s, a)$$

where

$$\mathcal{P}_k(s, a) = \{P' : |P_h'(s' \mid s, a) - \hat{P}_{h, k}(s' \mid s, a)|\leq \beta^p_{h, k}(s, a, s'), \forall h \in [H], s' \in \mathcal{S}\}$$

i.e., transitions that are within $\beta_{h, k}$ everywhere of the empirical transitions. Here we have

$$\beta_{h, k}^p(s, a, s') = \sqrt{\frac{4\text{Var}(\hat{P}_{h, k}(s' \mid s, a))L}{n_{h, k(s, a)}\vee 1}} + \frac{14L}{3(n_{h, k}(s, a) \vee 1)}$$

where

$$L = \log\left(\frac{2SAHK}{\delta}\right)$$

Theorem: the true model $P$ is an element $\mathcal{P}_k$ for any $k\in [K]$ with probability $\leq 1-2\delta$.

Similarly, we can construct

$$\mathcal{R}_k = \{r' : |r'_h(s, a) - \hat{r}_{h, k}(s, a)| \leq \beta_{h, k}^l(s, a) \ \forall s, a, h\}$$ $$\mathcal{C}_k = \{c' : |c'_h(s, a) - \hat{c}_{h, k}(s, a)| \leq \beta_{h, k}^l(s, a) \ \forall s, a, h\}$$ $$\beta^l_{h, k}(s, a) = \sqrt{\frac{L'}{n_h^k(s, a )\vee 1}}$$ $$L' = 2\log\left(\frac{6SAHK}{\delta}\right)$$

which contain the true $r$ and $c$ with probability $\geq 1-\delta$.

Let

$$\mathcal{M}_k = \mathcal{P}_k\otimes \mathcal{R}_k \otimes\mathcal{C}_k$$

Approach

In the OFU approach, we would find

$$(\pi_k, P_k) = \arg \max_{\pi', (p', r', c') \in \mathcal{M}_k}V^{\pi'}_{r'}(P')\text{ s.t. }V_{c'}^{\pi'}(P') \leq \overline{C}$$

Can be solved with linear programming, but notably might not be safe according to original $c$.

Instead, add a pessimistic penalty

$$\overline{c}_{h, k}(s, a) = \hat{c}_{h, k}(s, a)+ \beta_{h, k}^l(s, a) + H \overline{\beta}^p_{h, k}(s, a)$$

where

$$\overline{\beta}_{h, k}^p(s, a) = \sum_{s' \in \mathcal{S}}\beta^p_{h, k}(s, a, s')$$

However, such a pessimistic penalty may prevent exploration neeed to find the optimal policy. So we can modify

$$\overline{r}_{h, k}(s, a) - \frac{3H}{\overline{C} - \overline{C}_b}\beta^l_{h, k}(s, a) - \frac{H^2}{\overline{C}-\overline{C}_b}\overline{\beta}_{h, k}^p(s, a)$$

And then instead solve the problem

$$(\pi_k, P_k) = \arg\max_{\pi', P' \in \mathcal{P}_k}V_{r_k}^{\pi'}(P')\text{ s.t. }V^{\pi'}_{\overline{c}_k}(P') \leq \overline{C}$$

This is

Optimistic about the cost $c$
Optimistic about $P_k$
Pessimistic about $\overline{C}$

However, may take a while at initial steps, so we use the safe baseline instead for early steps:

Pasted image 20260314161414.png

Theoretical Results

Theorem: With a probability $\geq 1-5\delta$, all policies $\pi_k$ are safe

Theorem: The DOPE algorithm satisfies

$$R(K) \leq \tilde{\mathcal{O}}\left(\frac{SH^3}{(\overline{C} - \overline{C}_b)} \sqrt{AK}\right)$$