Discussing the article: "Neural networks made easy (Part 60): Online Decision Transformer (ODT)"

 

Check out the new article: Neural networks made easy (Part 60): Online Decision Transformer (ODT).

The last two articles were devoted to the Decision Transformer method, which models action sequences in the context of an autoregressive model of desired rewards. In this article, we will look at another optimization algorithm for this method.

The Online Decision Transformer algorithm introduces key modifications to Decision Transformer to ensure effective online training. The first step is a generalized probabilistic training goal. In this context, the goal is to train a stochastic policy that maximizes the probability of repeating a trajectory.

The main property of an online RL algorithm is its ability to balance exploration and exploitation. Even with stochastic policies, the traditional DT formulation does not take exploration into account. To solve this problem, the authors of the ODT method define the study through the entropy of the policy, which depends on the distribution of data in the trajectory. This distribution is static during offline pre-training, but dynamic during online setup as it depends on new data obtained during interaction with the environment.

Similar to many existing maximum entropy RL algorithms, such as Soft Actor Critic, the authors of the ODT method explicitly define a lower bound on policy entropy to encourage exploration.

Author: Dmitriy Gizlyk