Provable Safe Reinforcement Learning with Binary Feedback

02 Mar 2026

Provable Safe Reinforcement Learning with Binary Feedback, Bennett et. al. 2022

Setting

RL problem
Agent cannot take unsafe actions during training
Safety function unknown, can use human feedback to query whether an action is safe
Feedback happens offline between episodes
Minimized feedback

Setup / Notation

Rewardless MDP: $M = (\mathcal{S}, \mathcal{A}, T, \mu, H)$
- $H \in \mathbb{N}$ is the horizon
Policy $\pi: \mathcal{S} \to \Delta(\mathcal{A})$, possibly policies $\Pi$
Suboptimality metric:
$$\texttt{SubOpt}(\hat{\pi};\Pi) := \sup_{\pi \in \Pi} V(\pi) - V(\hat{\pi})$$
Generic RL algorithm: we have access to a blackbox RL algorithm $\texttt{Alg}$ for a set of reward-free MDPs $\mathcal{M}$ with $M \in \mathcal{M}$.
Given
- $R': \mathcal{S} \times \mathcal{A} \to [0, 1]$
- $\epsilon, \delta \in (0, 1)$
- $\texttt{Alg}(R', \Pi, \epsilon, \delta)$ returns a policy $\hat{\pi} \in \Pi$ such that $\texttt{SubOpt}(\hat{\pi}, \Pi) \leq \epsilon$ with probability $\geq 1-\delta$.
- Moreover, it is guaranteed to do so for $n_{\texttt{Alg}}(\epsilon, \delta)$ episodes for some polynomial $n_{\texttt{Alg}}$ – this is the sample complexity
- (I’m assuming we need the reward space to be $[0, 1]$ in order for such bounds to be possible)
Safety oracle: we have access to $f^* : \mathcal{S} \times \mathcal{A} \to \pm 1$, where $f^*(s, a)= 1$ if and only if $(s, a)$ is safe
We assume $f^*$ is a member of some hypothesis class $\mathcal{F}$ with finite VC dimension, and this class is known
We use this to define a set of safe policies
$$\Pi_{\text{safe}}(f) = \{\pi_{\text{safe}}\} \cup \{\pi \in \Pi:f(s, \pi(s)) = 1 \text{ a.s.} \ \forall s \in \mathcal{S}\}$$
Goal: Find $\hat{\pi}$ with
$\texttt{SubOpt}(\hat{\pi}; \Pi) \leq \epsilon$ ($\hat{\pi}$ is approximately optimal)
$\hat{\pi} \in \Pi_{\text{safe}}$
The agent never takes unsafe actions during training

Algorithm

For any safety-labelled dataset $\mathcal{D}$ of tuples $(s, a, f^*(s, a))$, let the version space $\mathcal{V}(\mathcal{D})$ be the subset of $\mathcal{F}$ consistent with the data.
Moreover, for $a \in \mathcal{A}$ we define the disagreement region $RD^a(\mathcal{D})$ to be the set of states such that $(s, a)$ does not agree on all of $\mathcal{V}(D)$.
Finally, let $\Pi(\mathcal{D})$ give the set of verifiably safe policies, i.e., $$\Pi(D) := \bigcap_{f \in \mathcal{V}(\mathcal{D})} \Pi_{\text{safe}}(f)$$

The the SAfe Binary-feedback REinforcement learning (SABRE) algorithm can be defined as follows:

It takes in epochs $N$, batches $B$, rollouts $m$, accuracy parameters $\epsilon_{\text{explore}}$ and $\epsilon_{R}$ and probability parameters $\delta_{\text{explore}}$ and $\delta_R$
The algorithm is pretty brute force. Basically for each epoch
- We define a reward function $R(s, a) = \mathbf{1}\{\exists a' \in \mathcal{A}:s \in RD^{a'}(\mathcal{D})\}$
  - This incentivizes visiting states that have disagreement actions, but does not incentivize taking disagreement actions due to the requirements that our policy must be safe.
  - Note that the original reward $R$ is not used here at all. We could be spending a ton of time on useless state / actions, and this seems like a major inefficiency of the algorithm.
- Then find $\pi$ with $\texttt{Alg}$ called on $R$ and $\Pi(\mathcal{D})$
- Finally, label all actions where there are disagreements
After doing all of this, return a safe policy using the original reward

Pasted image 20260207164042.png

Theoretical Results

Policy-Cover Dimension: There exists some positive integer $d_{\Pi}$ such that, for any given set of policies $\Pi \subseteq \Pi_{\text{safe}}$, there exists a set of policies $\pi_1, \ldots, \pi_{d_\Pi}$ satisfying

$$ \frac{1}{H}\sum_{h=1}^H \mathbb{P}\left[s_h \in \tilde{\mathcal{S}}\right] \leq \sum_{i=1}^{d_\Pi} \left(\frac{1}{H}\sum_{h=1}^H \mathbb{P}_{\pi_i}(s_h \in \mathcal{\tilde{S}})\right) $$

for all measurable $\mathcal{\tilde{S}} \subseteq \mathcal{S}$ and $\pi \in \Pi$. $d_\pi$ is the policy-cover dimension.

For finite-state environments, this can be rewritten in terms of occupancy measure as

$$\eta^{\pi}\cdot \mathbf{1}_{s} \leq (\eta^{\pi_1} + \ldots + \eta^{\pi_{d\Pi}}) \cdot \mathbf{1}_s$$

for any $s \in \mathcal{S}$, where $\mathbf{1}_s$ gives $1$ for each state-action pair with $s$.

Disagreement Coefficient: For any $\pi \in \Pi$, $h \in [H]$, $r \geq 0$ denote

$$B_{h, \pi}(r) = \{f \in \mathcal{F}: \mathbb{P}_\pi[f(s_h) \neq f'(s_h)] \leq r\}$$

and the disagreement coefficient

$$\theta_{h, \pi}(r_0) = \sup_{r > r_0}\frac{\mathbb{P}[\exists f, f' \in B_{h, \pi}(r):f'(s_h) \neq f(s_h)]}{r}$$

Assumption: There exists some fixed $d_\theta < \infty$ such that $\theta_{h, \pi}(0) < \infty$ for all $h$ and $\pi$.

Theorem:

Let

$d_{VC}$ denote the VC dimension of $\mathcal{F}$
Let $\epsilon, \delta$ be given
Then with carefully-tuned parameters (that I don’t want to write out), we have that with probability $\geq 1-\delta$
The sample complexity is $$ \mathcal{O}\left(H^4\epsilon^{-1}\log(\delta^{-1})d_{\pi}d_\theta^2d_{\mathsf{VC}}\right) $$ $$ \mathcal{O}\left(A\log(\epsilon^{-1})\log(\delta^{-1})H^2d_{\Pi}d_{\theta}^2d_{\mathsf{VC}}\right) $$