Taking Neural Networks to the next level - page 17

 

Bayne, your idea of stacking neural networks doesn't add up. You think that for example

pricedata --> momentum-regressor      \

                                                            | ----> prediction

pricedata --> trend strength-regressor /

is better than

pricedata --> prediction

But there is a logical flaw: you knew I'd say this, but by chosing preliminary labels that precede the final/main label, you again reduce information. Who knows what other intermediate labels would make sense and are missing details that would further improve the prediction? A third? A fourth?... 1000?

This is a little different if we need to answer several independent questions (parallel vs. serial). The metalabel example (although serial) is an exception, because the results of the first network are essential for the trading decision, they are further used outside the network and therefore can't be inside the 'hidden' network part; so they are more than just an intermediate step towards the real output AND, most importantly, this allows us to combine regressor and classifier, i.e. a quantitative result with a probability rating. A single network just can't do that. But regressor stacking for only one main labeling level doesn't make much sense.

I'm saying this over and over again: the more data you preserve until the final steps and the more freedom you allow for, the better the network will perform. Don't overcontrol and have a little trust in neural networks. Any limitation can't increase the quality of the results. ANNs are designed to select by themselves what's relevant and what's not, because the network weights that originate from irrelevant features won't be reinforced; they will just go to zero and end up in dead branches.

Just imagine for a second that a neural network had only binary activations (1 or 0). Then with only 1000 neurons the combination of possible activations is already a number with 300 digits in front of the decimal point. But in reality a neuron is not binary, so the actual complexity of processed preliminary calculations is beyond any imagination. And a usual network easily has several 100.000 weight connections. Just imagine how many variations are possible for the main dataflow for a given input. What I'm trying to say: ANNs are so abundant, that they can represent almost anything if only we let them.

If you really want to imply "subfeatures" (like your momentum example) better take a branched network architecture (=that is not fully connected) or a CNN, but don't force the network towards arbitrarily chosen features. Preserve the data and the degrees of freedom and you get more.

What you can of course do: don't stack the networks, but actually ask several questions in parallel (via separate networks or just heterogeneous outputs) and get answers for trend, momentum... and whatever you need... and then make a combined trading decision, for example with a Hidden Markov Model.

 
NELODI:
Can I turn my strategy into an EA? No, I can't. Because I can't copy my Brain into a computer.
Bayne:

*Elon Musk has left the chat*

LOL! Elon Musk is the Modern-World Chuck Norris.

There's nothing he can't do, with both hands tied behind his back.

 
Chris, I agree with your statement that a Neural Network should produce the best results if it can be fed with raw data, but how do you feed prices into a neural network, when input and output neurons can only have 2 "meaningful" states (0..1 or -1..1) and all the neurons and activation functions are binary? Are you converting prices to some binary representation, which you can feed into your Neural Network, or what am I missing here?
 
Chris70:

Bayne, your idea of stacking neural networks doesn't add up. You think that for example

pricedata --> momentum-regressor      \

                                                            | ----> prediction

pricedata --> trend strength-regressor /

is better than

pricedata --> prediction

But there is a logical flaw: you knew I'd say this, but by chosing preliminary labels that precede the final/main label, you again reduce information. Who knows what other intermediate labels would make sense and are missing details that would further improve the prediction? A third? A fourth?... 1000?

This is a little different if we need to answer several independent questions (parallel vs. serial). The metalabel example (although serial) is an exception, because the results of the first network are essential for the trading decision, they are further used outside the network and therefore can't be inside the 'hidden' network part; so they are more than just an intermediate step towards the real output AND, most importantly, this allows us to combine regressor and classifier, i.e. a quantitative result with a probability rating. A single network just can't do that. But regressor stacking for only one main labeling level doesn't make much sense.

I'm saying this over and over again: the more data you preserve until the final steps and the more freedom you allow for, the better the network will perform. Don't overcontrol and have a little trust in neural networks. Any limitation can't increase the quality of the results. ANNs are designed to select by themselves what's relevant and what's not, because the network weights that originate from irrelevant features won't be reinforced; they will just go to zero and end up in dead branches.

Just imagine for a second that a neural network had only binary activations (1 or 0). Then with only 1000 neurons the combination of possible activations is already a number with 300 digits in front of the decimal point. But in reality a neuron is not binary, so the actual complexity of processed preliminary calculations is beyond any imagination. And a usual network easily has several 100.000 weight connections. Just imagine how many variations are possible for the main dataflow for a given input. What I'm trying to say: ANNs are so abundant, that they can represent almost anything if only we let them.

If you really want to imply "subfeatures" (like your momentum example) better take a branched network architecture (=that is not fully connected) or a CNN, but don't force the network towards arbitrarily chosen features. Preserve the data and the degrees of freedom and you get more.

What you can of course do: don't stack the networks, but actually ask several questions in parallel (via separate networks or just heterogeneous outputs) and get answers for trend, momentum... and whatever you need... and then make a combined trading decision, for example with a Hidden Markov Model.

But what if we also feed pricedata into the "main" Net? You can also see it as "feeding the output of some extra neural networks as extra features to the actual neural net". Of course these extra "feature nets" need to be highly accurate & precise.

The main intention behind it is not to create a better performing neural net, but to extra bias the neural nets we want to train. we artificially bias it into a specific direction (wheter it can outperform the "normal" approach is unknown).

This could result in a NN which performs better in specific ares (as long as the feature nets are highly accurate & precise), we still feed the main network its prices.


What in your case would a Finance Price related Hidden Markov Model look like?

 
NELODI:
Chris, I agree with your statement that a Neural Network should produce the best results if it can be fed with raw data, but how do you feed prices into a neural network, when input and output neurons can only have 2 "meaningful" states (0..1 or -1..1) and all the neurons and activation functions are binary? Are you converting prices to some binary representation, which you can feed into your Neural Network, or what am I missing here?

Read this ftp://ftp.sas.com/pub/neural/FAQ2.html#A_std regarding treating data for NN's.

The FAQ is quite old, but still holds valuable info especially when not using some standard lib where all is happening under the hood.

I highly recommend reading the whole FAQ .

link to FAQ index
 
Enrique Dangeroux:

Read this ftp://ftp.sas.com/pub/neural/FAQ2.html#A_std regarding treating data for NN's.

link to FAQ index

Thank you.

PS. This Forum messes up the FTP link. If anyone else tries to click the link and gets a "Server Not Found" error, check the Address Bar and make sure the Address starts with "ftp://" and not just "ftp//" (":" is important, but the Forum removes it).

 

@NELODI: Sorry for the misunderstanding, in practice usually all numbers, i.e. all weights, inputs, outputs and cell states are doubles or floats, so there isn't just on/off or 1 and 0.

Some activation functions have a limited range (or even the more rarely used step function is indeed binary), for example a sigmoid function only returns positive values between 0 to 1, a tanh function -1 to +1 or something like ELU -1 to +inf... etc.

But as long as the weights can be any real number (positive or negative), the next following neurons can also receive the full spectrum of real numbers.

You can on the other hand limit the range of inputs/activations/outputs on purpose(!), by choice(!), for example if you have inputs that by nature have no continuous spectrum or with "one-hot" encoded features. Or for example with classifiers if you're only interested in a clear decision ("cat or dog..") and not probabilities.

--

@Enrique Dangeroux: I can't open the link.

Not sure if this is what you mean, but this is how I'd summarize the usual data treatment:

1. feature selection (good overview: https://machinelearningmastery.com/an-introduction-to-feature-selection/) and label definition

2. check for invalid input data / inconsistencies and define rules how to deal with them

3. stationary transformation for time series input data

4. normalization of input data in order to make sure they all fit distributions with a similar mean (usually 0) and standard deviation (this is not obligatory, but can speed up the training; visual explanation: the "landscape" in which gradient descent happens then ressembles more a round pit than the Grand Canyon and allows for a more direct path to the global error minimum

5. chosing a weight initialization method that matches the chosen activation function: range? uniform or normal distribution? The methods according to Kaiming He and Xavier Glorot are the most popular ones. If this step isn't done properly, there is much higher risk that the network produces NaNs and Inf errors.

6. Select an output activation function that can cover the range of the outputs' labels. Alternatively: scale the labels up or down to make them match up with the output activation.

___

@Bayne: neural networks reveal the best information if they know the complete picture; with gradient descent adding an arbitrary bias that's also derived from the same complete picture is like giving the metaphorical ball a sidekick instead of letting it just roll straight downhill the path of least resistance. If this leads to faster convergence, it's by luck. And remember: the error is measured against the labels, not the inputs, so the ideal global error minimum remains the same.

The Hidden Markov Model is one out of many options for how we can derive a trading decision from several output variables like momentum, trend strength and trend direction. In short: a Hidden Markov Model calculates the probability of an unobserved but dependent variable. Please google it for details. Disclaimer: I only understand the theoretical concept but havn't used it in my code. It was just an example. As you know I worked with Q learning and genetic decision making which in my opinion are also good options.

 

Apparently the forum can not deal with links with ftp protocol.

There is a lot of misconceptions floating around regarding NN's mainly due to these click bait, hype type, blogs doing nothing more than echoing each others content (including the mistakes). Not sure if the link you provide is one of them, but i do not consume information from sources with an ulterior motive. MLM sells courses and the articles seems to be written for the purpose of attracting customers.

The FAQ i posted addresses many of the misconceptions, pitfalls with clear explanation as to why, possible solutions and good reference to further literature with no ulterior motive. 

 
Chris70:

The Hidden Markov Model is one out of many options for how we can derive a trading decision from several output variables like momentum, trend strength and trend direction. In short: a Hidden Markov Model calculates the probability of an unobserved but dependent variable. Please google it for details. Disclaimer: I only understand the theoretical concept but havn't used it in my code. It was just an example. As you know I worked with Q learning and genetic decision making which in my opinion are also good options.

I know HMMs, my question was more about its integration into trading. more specifically, what the unobserved variable in this case would be

 
Bayne:

I know HMMs, my question was more about its integration into trading. more specifically, what the unobserved variable in this case would be

I guess that's up to the freedom of imagination of the person who designs the model; the most obvious choice would probably be as simple as: hidden ("unobserved") state 1="price moves up" vs. hidden state 2="price moves down" (within a defined time interval, a next renko, a next range bar...) --> translating into trading action buy or sell.