Preliminaries
Constrained MDPs
Constrained MDP:
$$\langle \mathcal{S}, \mathcal{A}, r, c, P, H, \overline{C}\rangle$$with objective
$$\min_{\pi}V_r^{\pi}(P) \ \ \text{ s.t. } \ \ V_c^{\pi}(P)\leq \overline{C}$$Assumption: The algorithm knows a safe baseline $\pi_b$ such that $V_c^{\pi}(P) < \overline{C}$
Empirical transitions
$$\hat{P}_{h, k}(s' \mid s, a) = \frac{n(s, a, s')}{\min(n(s, a), 1)}$$ $$\hat{r}_{h, k}(s, a) = \frac{1}{n_{h, k}(s, a) \vee 1} \cdot \sum_{k' = 1}^{k-1}r_h(s, a)\cdot \mathbf{1}\{s_{h, k'} = s, a_{h, k'} = a\}$$ $$\hat{c}_{h, k}(s, a) = \frac{1}{n_{h, k}(s, a) \vee 1} \cdot \sum_{k' = 1}^{k-1}c_h(s, a)\cdot \mathbf{1}\{s_{h, k'} = s, a_{h, k'} = a\}$$Confidence sets
We construct
$$\mathcal{P}_k = \bigcap\mathcal{P}_k(s, a)$$where
$$\mathcal{P}_k(s, a) = \{P' : |P_h'(s' \mid s, a) - \hat{P}_{h, k}(s' \mid s, a)|\leq \beta^p_{h, k}(s, a, s'), \forall h \in [H], s' \in \mathcal{S}\}$$i.e., transitions that are within $\beta_{h, k}$ everywhere of the empirical transitions. Here we have
$$\beta_{h, k}^p(s, a, s') = \sqrt{\frac{4\text{Var}(\hat{P}_{h, k}(s' \mid s, a))L}{n_{h, k(s, a)}\vee 1}} + \frac{14L}{3(n_{h, k}(s, a) \vee 1)}$$where
$$L = \log\left(\frac{2SAHK}{\delta}\right)$$Theorem: the true model $P$ is an element $\mathcal{P}_k$ for any $k\in [K]$ with probability $\leq 1-2\delta$.
Similarly, we can construct
$$\mathcal{R}_k = \{r' : |r'_h(s, a) - \hat{r}_{h, k}(s, a)| \leq \beta_{h, k}^l(s, a) \ \forall s, a, h\}$$ $$\mathcal{C}_k = \{c' : |c'_h(s, a) - \hat{c}_{h, k}(s, a)| \leq \beta_{h, k}^l(s, a) \ \forall s, a, h\}$$ $$\beta^l_{h, k}(s, a) = \sqrt{\frac{L'}{n_h^k(s, a )\vee 1}}$$ $$L' = 2\log\left(\frac{6SAHK}{\delta}\right)$$which contain the true $r$ and $c$ with probability $\geq 1-\delta$.
Let
$$\mathcal{M}_k = \mathcal{P}_k\otimes \mathcal{R}_k \otimes\mathcal{C}_k$$Approach
In the OFU approach, we would find
$$(\pi_k, P_k) = \arg \max_{\pi', (p', r', c') \in \mathcal{M}_k}V^{\pi'}_{r'}(P')\text{ s.t. }V_{c'}^{\pi'}(P') \leq \overline{C}$$Can be solved with linear programming, but notably might not be safe according to original $c$.
Instead, add a pessimistic penalty
$$\overline{c}_{h, k}(s, a) = \hat{c}_{h, k}(s, a)+ \beta_{h, k}^l(s, a) + H \overline{\beta}^p_{h, k}(s, a)$$where
$$\overline{\beta}_{h, k}^p(s, a) = \sum_{s' \in \mathcal{S}}\beta^p_{h, k}(s, a, s')$$However, such a pessimistic penalty may prevent exploration neeed to find the optimal policy. So we can modify
$$\overline{r}_{h, k}(s, a) - \frac{3H}{\overline{C} - \overline{C}_b}\beta^l_{h, k}(s, a) - \frac{H^2}{\overline{C}-\overline{C}_b}\overline{\beta}_{h, k}^p(s, a)$$And then instead solve the problem
$$(\pi_k, P_k) = \arg\max_{\pi', P' \in \mathcal{P}_k}V_{r_k}^{\pi'}(P')\text{ s.t. }V^{\pi'}_{\overline{c}_k}(P') \leq \overline{C}$$This is
- Optimistic about the cost $c$
- Optimistic about $P_k$
- Pessimistic about $\overline{C}$
However, may take a while at initial steps, so we use the safe baseline instead for early steps:

Theoretical Results
Theorem: With a probability $\geq 1-5\delta$, all policies $\pi_k$ are safe
Theorem: The DOPE algorithm satisfies
$$R(K) \leq \tilde{\mathcal{O}}\left(\frac{SH^3}{(\overline{C} - \overline{C}_b)} \sqrt{AK}\right)$$