Discussing the article: "MQL5 Wizard Techniques you should know (Part 49): Reinforcement Learning with Proximal Policy Optimization"

 

Check out the new article: MQL5 Wizard Techniques you should know (Part 49): Reinforcement Learning with Proximal Policy Optimization.

Proximal Policy Optimization is another algorithm in reinforcement learning that updates the policy, often in network form, in very small incremental steps to ensure the model stability. We examine how this could be of use, as we have with previous articles, in a wizard assembled Expert Advisor.

We continue our series on the MQL5 wizard, where lately we are alternating between simple patterns from common indicators and reinforcement learning algorithms. Having considered indicator patterns (Bill Williams’ Alligator) in the last article, we now return to reinforcement learning, where this time the algorithm we are looking at is Proximal Policy Optimization (PPO). It is reported that this algorithm, that was first published 7 years ago, is the reinforcement-learning algorithm of choice for ChatGPT. So, clearly there is some hype surrounding this approach to reinforcement learning. The PPO algorithm is intent on optimizing the policy (the function defining the actor’s actions) in a way that improves overall performance by preventing drastic changes that could make the learning process unstable.

It does not do this independently, but works in tandem with other reinforcement learning algorithms, some of which we have looked at in these series, that broadly speaking are in two categories. Policy-based algorithms and value-based algorithms. We have already looked at examples of each of these in the series, and perhaps to recap, the policy-based algorithms we saw were Q-Learning, and SARSA. We have only considered one value-based method, and that is temporal difference. So, what is PPO all about, then?

As alluded to above, the ‘problem’ PPO solves is preventing the policy from changing too much during updates. The thesis behind this is if there is no intervention in managing update frequency and magnitude, the agent might: forget what it learned, make erratic decision, or perform worse in the environment. PPO thus ensures the updates are small but meaningful. PPO works by starting with a policy that is predefined with its parameters. Where policies are simply functions that define actor’s actions based on rewards and environment states.

Author: Stephen Njuki