Decision Transformer Paper Summary

Decision Transformer: Reinforcement Learning via Sequence Modeling, Chen et. al., 2021

Overview

Decision Transformers are used to learn behavior of trajectories for a given MDP and policy. Thus, they are (at least in this paper) used for imitation learning and offline RL rather than training an agent via traditional online methods.

Preliminaries

Offline RL

In offline RL, you are only given a dataset of trajectories and have to train a policy using these with no further sampling.

Transformers

Already familiar

Method

We use a transformer to model trajectories. One issue is modelling rewards – ideally an agent should aim to take actions that maximize future rewards rather than just the present reward, making future rewards more informative about an agent’s behavior. The reward-to-go is defined as

$$ \hat{R}_t = \sum_{t'=t}^Tr_t' $$

for timestep $t$. Then trajectories are given as

$$(s_0, a_0, \hat{R}_0), (s_1, a_1, \hat{R}_1), \ldots , (s_T, a_T, \hat{R}_T)$$

Then this transformer is trained to predict the next actionusing cross-entropy loss (or MSE for continuous action spaces), for a given set of trajectories.

To draw some version of an improved policy from this, we can set the initial future rewards to some target total reward $R$, and then keep decrementing the future rewards from $R$. Thus, setting a high target allows us to improve our policy beyond just imitating any single policy, and this method can learn policies without an expert. The biggest weakness of this approach is it cannot find any good behaviors that are not present in the sample trajectories.

Experiments

They did a bunch of offline RL (using this “optimized” policy) and imitation learning evaluations, I can read the details if I need to implement something like that.

tl;dr, it can

  • Learn without expert trajectories
  • Has strong offline performance
  • Handles long-horizon credit assignment well

Takeaways

  • The paper didn’t actually seem super insightful, just seems like they were the first ones to do obviously low-hanging fruit.
  • Here the decision transformer is primarily used for offline RL. My main curiosity after this paper is how transformers have made their way into more complicated online RL training algorithms and model-based agents. Being able to come up with accurate ways to model trajectories of a policy seems broadly very useful, and I feel like we didn’t see its full power here since the offline setting caps its performance.