Taking Neural Networks to the next level - page 3

 
Icham Aidibe:

I pin the thread to follow the progression of this project. 

@Marco vd Heijden, neural network & patterns, it seems to me you also working on such a project, are you ?


That is correct @Icham Aidibe

I have done some work along these lines in the past.

But as you know i was writing about a different approach similar to pattern recognition.

As in facial recognition or to recognize a cat in a picture or to decode human handwriting or voice to text it's all basically the same networks just adapted to a different target application.

For me the most important part is that it has to be able to learn by itself only feeding off of win and lose feedback and a OHLC datafeed.

 

Yup, then I did well to call you there around 

@Chris70 : Marco is your man ! 

 

time for an update... in the meantime I

1. added the "ADAM" Optimizer algorithm as another way to make the learning process more efficient (the Python guys will know what I'm talking about); it's not really necessary, but also gives me more options for future work with neural networks

2. implemented apoptosis and pruning functions; now what the hell is this guy talking about? well... "apoptosis" is the medical/biological term for "programmed cell death", which is a natural occurance in the aging body and it also kills some tumor cells before the develop into actual cancer (which other cells, that escape such mechanisms then might do..), but dying of cells and being replaced by fresh cells is also an important feature of repair mechanisms for the health of any tissue and it also plays its role during brain development in our childhood --> and here is the connection to neural networks:

When we decide on the architecture of a network, the decisions how many layers and how many neurons per layer we actually need isn't that obvious at all. Apart from that, an oversized network also is more likely to suffer from overfitting. So "apoptosis" basically means "killing" of redundant neurons after performing a scheduled ranking of the significance of each neuron. There are different methods how the ranking can be done, e.g. by looking at the total sum of all (absolute) inoing and outgoing weights to/from an individual neuron. The general concept of apoptosis goes back to the early 90s and it never was a big thing, but it might actually be in the near future when we need to make neural networks more efficient for mobile applications. The reason, why it was not super successful until now is probably that reducing the number of neurons comes with a price, i.e. accuracy decline. There is a compromise to be made between making the network shallower (and faster and less memory demanding) and accuracy on the other hand. Up to a point the impact isn't huge, though. I ran a little test with my autoencoder by making it completely oversized at first: with 13 layers, I chose 720 neurons for EACH layer (=also all hidden layers) except for the bottleneck layer (8676 neurons in total (!!!).... now this is obvious overkill...on purpose for the experiment). Then I made the algorithm "kill" 50% of the neurons (scheduled over the first 10000 backpropagation iterations) and compared the results to another test run where I kept all neurons alive: with the 50% "apotosis" the mean squared error (MSE) was 2.5% higher. Not much of a difference...

"pruning" is also a concept that is stolen directly from neuro-anatomy, but this time it doesn't refer to the neuron cells, but to their connections, namely those "axon" antennas. For the brain, especially during early development it's "use it or lose it". Connections that are not used can be retracted, which is part of the plasticity of the brain. Back to computers and neural networks: why not make a neural network more efficient by deleting some irrelevant weights that have very low absolute values or rarely contribute to neuron activations ("deleting"=assigning "0" values to them in the forward pass and ignoring them in backpropagation, which is the more performance demanding part)? That's exactly what "pruning" is about: we give our network a little haircut by making it remove some weights and therefore making it more sparse. The effects are similar to "apoptosis": you do a little and get more performance, you do too much and you lose accuracy. So both methods have their issues, but I think it's nice to know about the concepts and if any of you is thinking about developing your own networks in MQL, it's just one more thing worth a consideration.

3. [and then I also did some bugfixing and changes on the file structure, how I save all those data]

Here is an example (EURUSD 1min chart) of the way how the trained autoencoder (here now an example after ~19.000 iterations) now "sees" a chart (for this example I used real tick data and 720 inputs with >=5 second increments, adding up to 1 hour of data per input, encoded to 36 numbers in the "bottleneck" neurons, so the encoder is reducing the data by 95%. I think with this example you can see what I meant be "denoising" in an earlier post: the rebuilt price pattern follows the main(!) moves, but not every little spike. Compared to e.g. just using a moving average, you can see that regression encoding has the benefit of having absolutely no lag. The resolution is not super detailed, but that's exactly what we want - a simple representation of what the price is actually doing:

autoencoder2

 

You  seem  to  have profound grasp of deep  nets.I guess you  have a biostatistics background .I  have  some questions:

---Would  information bottleneck  principle  make further improvement  to  your existent deep  net algorithm ? ?https://arxiv.org/abs/1503.02406

---You said  you added the "ADAM" Optimizer algorithm   .Will you  share  the  full  codes  in the forum ?Thanks.

Deep Learning and the Information Bottleneck Principle
  • arxiv.org
Deep Neural Networks (DNNs) are analyzed via the theoretical framework of the information bottleneck (IB) principle. We first show that any DNN can be quantified by the mutual information between the layers and the input and output variables. Using this representation we can calculate the optimal information theoretic limits of the DNN and...
 
It would be interesting to have a comparative between standard NN results & your tweaked one. 
 
nevar:

You  seem  to  have profound grasp of deep  nets.I guess you  have a biostatistics background .I  have  some questions:

---Would  information bottleneck  principle  make further improvement  to  your existent deep  net algorithm ? ? https://arxiv.org/abs/1503.02406

---You said  you added the "ADAM" Optimizer algorithm   .Will you  share  the  full  codes  in the forum ?Thanks.

My professional background is in a completely different field, so everything I know is self-taught and there are for sure many things I still need to learn about. But just like I learnt a lot by reading articles that are publically available on the internet, I hope to be helping some guys out there who are on a similar journey as I am. So I believe (/hope) that the content of this series is of some value for anybody working with neural networks, even if I don't plan to give away the entire code. Please understand that there has been a lot of work put into it. I think the potential benefit of this series is more to help with the underlying theory, give some ideas and pointing out some problems that one might stumble upon when developing neural networks. Apart from this, I'm doing this on the basis of an experiment, i.e. investigating on the pratical usefulness of LSTM-networks with autoencoders for price forecasting. 

The principle behind the "ADAM" algorithm has been explained in many good articles and also the formulas are publically available. Here is a good example about Optimizers, including Adam:

http://ruder.io/optimizing-gradient-descent/

But please note that those optimizers are a nice gimmick that speeds up the learning process a little, but they are by no means necessary. You'll do just fine with simple vanilla gradient descent!! There are only two things I highly recommend in this case:

(1) working with time decay for the learning rate in order to allow for fine-tuned learning at the later training stages; this can be done be simple multiplying the learning rate with a factor  "factor/iterations"

(2) adding some kind of momentum component in order to reduce the risk of getting stuck in local optima of the loss function, which prevents the overall error from further decline; the simplest way how this can be done is by performing the weight corrections by using a running moving average of the calculated necessary corrections (instead of the individual values).

The bottleneck principle is the essential component of any neural network autoencoder, so of course I don't only make use of it in my algorithm, but the whole autoencoder principle would be useless without the bottleneck part. I explained this in post N°4 of this thread.

An overview of gradient descent optimization algorithms
An overview of gradient descent optimization algorithms
  • 2016.01.19
  • ruder.io
This post explores how many of the most popular gradient-based optimization algorithms actually work. Note: If you are looking for a review paper, this blog post is also available as an article on arXiv. Update 24.11.2017: Most of the content in this article is now also available as slides. Update 15.06.2017: Added derivations of AdaMax and...
 

Update with good news: now that the autoencoder part is complete and that I'm therefore having some usable data, I was now also able to couple the autoencoder with the LSTM network and all bugs seem to be fixed for now.

I did a few short test runs that confirmed that the LSTM network is performing the learning process correctly (decrease in the loss function and output of some predicted chart patterns (=after re-encoding) that visually make a good first impression.

So now it's time for the training part of the LSTM network and trying to make some price predictions that are hopefully useful enough to be put into some actual trading.


This series has made some progress, so for those who didn't read it all (and maybe as a quick recap summary), once again in a few words what the program is doing:

1. I trained a neural network (autoencoder type MLP) to be able to encode chart patterns of 1-hour intervals into more simplistic and denoised representations

2. this encoding process goes along with a reduction of the amount of data by 95% (720 prices with 5 second increment now represented by a series of only 36 numbers)

3. every 36-number sequence represents a single time step input (=one new input every hour) for another neural network: a "recurrent" neural network with a special type of memory cell neurons, the so called LSTM (for long short-term memory)

4. these LSTM network are particularly suitable for time series analysis - in this case: the attempt of forex price prediction

5. with its memory capability and a lookback period of given number of timesteps I will try to make a prediction by 1 hour ahead

6. the "output" = the predictions of the LSTM network will also be in the form of a 36-number sequence --> those are not understable at first glance and therefore need to be DEcoded first by once again using the trained autoencoder, now "in reverse" (more precisely: feeding the 36-number sequence into the "bottleneck" part of the autoencoder and calculation from there to the output layer (instead of going from the inputs to the bottleneck, like one would do for ENcoding)

7. the LSTM network will be trained to consider the future prices as correct output and will therefore try to do the same in live trading

8. this is possible because during training historical data are used and the "future" is then already known, so the network learns how it must behave to make the best predictions that the past data allow for

9. it is very well possible that there isn't much to be predicted at all, because the past might not indicate nothing at all about future price moves and maybe the "random walk" theory and the "efficient market hypothesis" are 100% correct - this is something that this experiment will have a look at

10. IF there is something useful to predict, I will then translate this information into trading conditions


If you google for pictures of LSTM networks, you will often see representations where there is just one neuron cell shown per timestep. This is confusing at first and often only serves the purpose of simplicity (although those very basic time series analysis networks with just one neuron and one input per timestep certainly do exist and they may work just fine, depending on the task at hand). When dealing with neural networks most of the variables actually are array elements, so almost everything is represented by vectors/matrices and not by single numbers (=scalars). This is why in graphical representations of neural network architectures often a single neuron in reality stands for a whole layer of neurons and it doesn't have just one input, but a vector of inputs, that each are associated with weights.

I'm mentioning all this because the LSTM network that this thread is about is also 3-dimensional: (1) several layers with (2) several neurons each are stacked upon each other and all of those neurons are connected via (3) the time dimension to earlier versions of themselves.

All the calculations both in the learning process and during forward testing and training go through this entire 3-dimensional grid. It possibly isn't surprising, that this comes with some challenges. I'm always astonished how the algorithm calculates its way through several thousand neurons per second. Yes: with MQL5 !! Say what you want about MQL, but nobody shall say that it can't be fast if we let it...

The callenge comes more with the absolute numbers, i.e. seeing some extremely high or low numbers that can no longer be represented as "double" precision floating point numbers. LSTM type recurrent neural networks (RNN) suffer much less from the "vanishing gradient" problem then simple RNN's, but this doesn't say that there are no limitations. If exploding or vanishing numbers are produced depends on many things, also on the scaling of inputs and labels (= targets = "real" values), on the method of weight and bias initialization, the type of activation functions, the dimensions of the network and the size in each dimension.

In my early test runs (all with 36 neurons in all LSTM layers) I was able to get the training going with for example:

- 500 timesteps and just 1 LSTM layer

- 48 timesteps and 3 LSTM layers

- 24 timesteps and 5 LSTM alyers

every time I went significantly beyond these orders of magnitude, I was seeing NaN/Inf numbers and no quick decrease of the loss function;

I need some more fine-tuning before I decide on which dimensions I will go with, but at least it totally seems like something useful to work with.

 

Before the actual test in any "scientific" experiment, of course the hypothesis ("significant predictions are possible") versus the Null hypothesis ("it's just a random walk") and the methodology need to be clearly defined.

But once I get the results, it should also be clear how to interpret them. And these decisions should be made in advance, so not by just collecting the results and deciding if I like what I see.

I attached a file as an example of what my network test result reports currently look like (forget about the actual numbers there, this is really just an example).

I certainly want to look at

- r squared (coefficient of determination)

- mean absolute error

- max. absolute deviation

But especially with "r squared", the problem is, that there is no cut-off that defines "good enough". I clearly shouldn't be zero, but I will need something else in order to decide if my predictions are statistically significant.

Question for the statistics guys out there: does anybody have a good idea which measure I could use?

I should mention that

(1.) both predictions and labels of course won't be 'normally' distributed due to the "fat tails" problem with financial data and

(2.) I'm not dealing with a single prediction, but a whole series for price forecasts for the next hour. Of course I can (and will) extract the predicted open/high/low/close from them, but it would be nice to have some kind of statistical tool to decide if the whole price sequence might also be just random, or if instead I have actually predicted something useful

(3.) it shouldn't be super complicated to implement into Mql code, because I would very much prefer to directly work with the numbers that I get returned by the program and rather not to depend on R / SPSS / MatLab ... 

Any thoughts / ideas very much appreciated !

Files:
testreport.txt  625 kb
 

go on man! you're a pioneer, buy/sell & show us something concrete expressed in USD

 

Focus on point 3. Then you have other means to your disposal like writing an indicator which does not require any statistical measure. Statistical measure alone can be misleading from just numbers, plotting a chart would reveal everything very quickly.

Here is a picture of a simple sine wave forecast, just to make sure the code i wrote works.

I work with no third party software, just MQL. I use "online training" method only. The picture shows the start (zero knowledge) and the progress. 


Here is pic from another "indicator" trying to forecast EURUSD. It can be noted that it shows the Naive forecast. 


My point being. Picture speaks more than thousand measurements.


Regarding testing for random walk. You can use your LSTM, if it reverts to naive forecast and you know your LSTM is capable of predicting, the culprit will be your data where no structure can be found. There are also other ways. Read this https://people.duke.edu/~rnau/411rand.htm.  Or just google random walk or random walk forecast.