Discussing the article: "Neural networks made easy (Part 64): ConserWeightive Behavioral Cloning (CWBC) method"

 

Check out the new article: Neural networks made easy (Part 64): ConserWeightive Behavioral Cloning (CWBC) method.

As a result of tests performed in previous articles, we came to the conclusion that the performance of the trained strategy largely depends on the training set used. In this article, we will get acquainted with a fairly simple yet effective method for selecting trajectories to train models.

The authors of the method propose a new conservative regularizer for return-conditioned behavioral cloning methods that explicitly encourage the policy to stay close to the original data distribution. The idea is to enforce the predicted actions when conditioning on large out-of-distribution returns to stay close to in-distribution actions. This is achieved by adding positive noise to RTGs for trajectories with high return and penalize the L2 distance between the predicted action and the ground truth. To guarantee that large returns are generated outside the distribution, we generate noise such that the adjusted RTG value is no less than the highest return in the training set.

The authors propose to apply conservative regularization to trajectories whose returns exceed the qth percentile of rewards in the training set. This ensures that when specifying an RTG outside of the training distribution, the policy behaves similarly to high-return trajectories rather than a random trajectory. We add noise and offset the RTG at each time step.

The experiments conducted by the method authors demonstrate that using the 95th percentile generally works well in a variety of environments and data sets.

The authors of the method note that the proposed conservative regularizer differs from other conservative components for offline RL methods based on estimating the costs of states and transitions. While the latter typically attempt to adjust the estimation of the cost function to prevent extrapolation error, the proposed method distorts the return-to-go to create out-of-distribution conditions and adjusts the prediction of actions.

During the training process, I managed to obtain a model that generates profit on the historical segment of the training sample.

Test results

Test results

During the training period, the model made 141 trades. About 40% of them were closed with a profit. The maximum profitable trade is more than 4 times the maximum loss. And the average profitable trade is almost 2 times higher than the average loss. Moreover, the average winning trade is 13% greater than the maximum loss. All this gave a profit factor of 1.11. Similar results are observed in new data.

Author: Dmitriy Gizlyk

Reason: