Discussing the article: "Neural networks made easy (Part 57): Stochastic Marginal Actor-Critic (SMAC)"

 

Check out the new article: Neural networks made easy (Part 57): Stochastic Marginal Actor-Critic (SMAC).

Here I will consider the fairly new Stochastic Marginal Actor-Critic (SMAC) algorithm, which allows building latent variable policies within the framework of entropy maximization.

When building an automated trading system, we develop algorithms for sequential decision making. Reinforcement learning methods are aimed exactly at solving such problems. One of the key issues in reinforcement learning is the exploration process as the Agent learns to interact with its environment. In this context, the principle of maximum entropy is often used, which motivates the Agent to perform actions with the greatest degree of randomness. However, in practice, such algorithms train simple Agents that learn only local changes around a single action. This is due to the need to calculate the entropy of the Agent's policy and use it as part of the training goal.

At the same time, a relatively simple approach to increasing the expressiveness of an Actor's policy is to use latent variables, which provide the Agent with its own inference procedure to model stochasticity in observations, the environment and unknown rewards.


Introducing latent variables into the Agent's policy allows it to cover more diverse scenarios that are compatible with historical observations. It should be noted here that policies with latent variables do not allow a simple expression to determine their entropy. Naive entropy estimation can lead to catastrophic failures in policy optimization. Besides, high variance stochastic updates for entropy maximization do not readily distinguish between local random effects and multimodal exploration.

One of the options for solving these latent variable policies shortcomings was proposed in the article "Latent State Marginalization as a Low-cost Approach for Improving Exploration". The authors propose a simple yet effective policy optimization algorithm capable of providing more efficient and robust exploration in both fully observable and partially observable environments.

Author: Dmitriy Gizlyk

 
Can someone help me understand how to use the code in the article for testing and demo trading.  Appreciate everyone's help!
 

Every pass of the Test EA generates drastically different results as if the modell were different from all previous ones. It is obvious that the model evolves every single pass of Test but the behaviour of this EA is hardly an evolution, so what stands behind it?

Here are some pictures:

graph1

graph2

graph3

 

Buy and sell transactions seem to be insufficiently controlled in the Test and possibly Research scripts. Here are some messages:

2024.04.27 13:40:29.423 Core 01 2024.04.22 18:30:00   current account state: Balance: 9892.14, Credit: 0.00, Commission: 0.00, Accumulated: 0.00, Assets: 0.00, Liabilities: 0.00, Equity 9892.14, Margin: 0.00, FreeMargin: 9892.14

2024.04.27 13:40:29.423 Core 01 2024.04.22 18:30:00   calculated account state: Assets: 0.00, Liabilities: 0.00, Equity 9892.14, Margin: 11359.47, FreeMargin: -1467.33
2024.04.27 13:40:29.423 Core 01 2024.04.22 18:30:00   not enough money [market buy 0.96 EURUSD.pro sl: 1.06306 tp: 1.08465]

2024.04.27 13:40:29.423 Core 01 2024.04.22 18:30:00   failed market buy 0.96 EURUSD.pro sl: 1.06306 tp: 1.08465 [No money]

Unless margin overruns are intended, simple limits put on buy_lot after line 275 and after line 296 put on sell_lot would eliminate this behaviour of the Test script.

 
Chris #:

Every pass of the Test EA generates drastically different results as if the modell were different from all previous ones. It is obvious that the model evolves every single pass of Test but the behaviour of this EA is hardly an evolution, so what stands behind it?

Here are some pictures:


This model use stochastic politic of Actor. So in the beginning of study we can see random deals at every pass. We collect this passes and restart study of the model. And repeat this process some times. While Actor find good politic of actions.

 

Let's put the question another way. Having collected (Research) samples and processed them (Study) we run the Test script. In several conscutive runs, without any Research or Study, the results obtained are completely different. 

Test script loads a trained model in OnInit subroutine (line 99). Here we feed the EA with a model which should not change during Test processing. It should be stable, as far as I understand. Then, final results should not change.

In the meantime, we do not conduct any model training. Only collecting more samples is performed by the Test.

Randomness is rather observed in the Research module and possibly in the Study while optimizing a policy.

Actor is invoked in line 240 in order to calculate feedforward results. If it isn't randomly initialized at the creation moment, I believe this is the case, it should not behave randomly.

Do you find any misconception in the reasoning above? 

 
Chris #:

Let's put the question another way. Having collected (Research) samples and processed them (Study) we run the Test script. In several conscutive runs, without any Research or Study, the results obtained are completely different. 

Test script loads a trained model in OnInit subroutine (line 99). Here we feed the EA with a model which should not change during Test processing. It should be stable, as far as I understand. Then, final results should not change.

In the meantime, we do not conduct any model training. Only collecting more samples is performed by the Test.

Randomness is rather observed in the Research module and possibly in the Study while optimizing a policy.

Actor is invoked in line 240 in order to calculate feedforward results. If it isn't randomly initialized at the creation moment, I believe this is the case, it should not behave randomly.

Do you find any misconception in the reasoning above? 

The Actor use stochastic policy. We implement it by VAE.

//--- layer 10
   if(!(descr = new CLayerDescription()))
      return false;
   descr.type = defNeuronBaseOCL;
   descr.count = 2 * NActions;
   descr.activation = SIGMOID;
   descr.optimization = ADAM;
   if(!actor.Add(descr))
     {
      delete descr;
      return false;
     }
//--- layer 11
   if(!(descr = new CLayerDescription()))
      return false;
   descr.type = defNeuronVAEOCL;
   descr.count = NActions;
   descr.optimization = ADAM;
   if(!actor.Add(descr))
     {
      delete descr;
      return false;
     }

Layer CNeuronVAEOCL use data of previous layer as mean and STD of Gaussian distribution and sample same action from this distribution. At start we put in model random weights. So it generate random means and STDs. At final we have random actions at every pass of model test. At time of study model will find some means for every state and STD tends to zero.    

Neural networks made easy (Part 21): Variational autoencoders (VAE)
Neural networks made easy (Part 21): Variational autoencoders (VAE)
  • www.mql5.com
In the last article, we got acquainted with the Autoencoder algorithm. Like any other algorithm, it has its advantages and disadvantages. In its original implementation, the autoenctoder is used to separate the objects from the training sample as much as possible. This time we will talk about how to deal with some of its disadvantages.
 
Rather, the Test script provides insight into the capabilities of the rest of the algorithm. Since there is still a degree of freedom in the form of variable, unregistered initial weight values at the stage of creating VAE, it is not possible to establish and recreate the full, optimal model. Was that supposed to be the purpose of this script?