RL Paper Summaries

Link: Mastering Atari with Discrete World Models, Hafner et.al., 2021

Overview:

Paper gives a new, state-of-the-art model-based RL agent, meaning it learns abstractions about the dynamics of its environment instead of just the states / rewards.

Recurrent State-Space Model Architecture:

RSSM is a model used for model-based RL to model different parts of a partial-observable markov decision process.

Let

  • $x_t$ represent observations
  • $z_t$ represents the inferred state of the environment
  • $h_t$ roughly represents memory of previous states / observations, and is used as part of an RNN

Then the model components are:

  • Recurrent model: $h_t = f_{\theta}(h_{t-1}, z_{t-1}, a_{t-1})$
  • Representation model: $z_t \sim q_{\theta}(z_{t} \mid h_t, x_t)$
  • Transition predictor: $\hat{z}_t \sim p_{\theta}(\hat{z}_t \mid h_t)$
  • Observation predictor: $\hat{x}_t\sim p_{\theta}(\hat{x}_t \mid h_t, z_t)$
  • Reward predictor: $\hat{r}_t \sim p_{\theta}(\hat{r}_t \mid h_t, z_t)$
  • Discount predictor: $\hat{\gamma}_t \sim p_{\theta}(\hat{\gamma}_t \mid h_t, z_t)$

here each function / sampling distribution is modeled with neural networks.

To train each component, we look at:

  • How unlikely each observation was according to the observation predictor
  • How unlikely each reward was according to the reward predictor
  • How unlikely each discount was according to the discount predictor (I’m assuming this is normally just constant)
  • How close our prior distribution is to the distribution achieved by sampling $x_t$ and then sampling $z_t$ from $q_{\theta}$ which is captured by the loss function
Dreamer V2 loss diagram

Imagination MDP:

By learning a model for the environment, our agent can imagine future trajectories and calculate their expected rewards to get an approximation for the value / Q functions. More specifically, note:

  • We can keep sampling $\hat{z}_t$ according to the transition predictor $p_{\theta}(\hat{z}_t \mid \hat{z}_{t-1}, \hat{a}_{t-1})$
  • We can sample rewards from the reward model $p_{\theta}(\hat{r}_t \mid \hat{z}_t)$
  • We can sample discounts from $p_{\theta}(\hat{\gamma}_t \mid \hat{z}_t)$

This can be used to train a critic in an actor-critic method.

Note that since $z_t$ and $h_t$ are markovian, we have a full MDP instead of a POMDP! This is the power of having a world model, we don’t force the policy to keep track of long-horizon behavior, but instead encode that into the latent space of the world model.

Actor-Critic Training Regime:

To train the agent, we train:

  • An actor $$\hat{a}_t \sim \pi_{\theta}(\hat{a}_t \mid \hat{z}_t)$$
  • A critic $$v_{\theta}(\hat{z}_t) \approx \mathbb{E}_{p_{\theta}, \pi_{\theta}}\left[\sum_{\tau \geq t}\hat{\gamma}^{\tau-t}\hat{r}_{\tau}\right]$$

The full training loop:

  • Sample a bunch of trajectories using real observations / rewards with policy $\pi_{\theta}$ (or some modified version of it)
  • Train the RSSM model against the real data
  • Freeze the world model
  • Using the Imagination MDP, train the critic to better approximate state values
  • Using the Imagination MDP and the critic, train the actor to maximize reward
  • Repeat