Machine learning in trading: theory, models, practice and algo-trading - page 25

 
SanSanych Fomenko:
"object" in R is much more complicated than in many programming languages.
Objects are not different, just functions str, plot, summary and the like are many times overloaded, each type has its own implementation
 
Alexey Burnakov:
I did not quite understand why it took so long. What is the optim_var_number? Should be within 10. Set it to 1200 seconds and it should already be something.

I have a lot of predictors (9602), that's why it takes so long. They are taken for 2015 from eurusd d1, there are all sorts of prices, times and indicators. I haven't gotten out of d1 yet, so the number of training examples is only 250+, by the number of trading days in a year. No skips.

optim_var_number = 0.8662363

Files:
trainData.zip  14378 kb
 
Dr.Trader:

I have a lot of predictors (9602), that's why it takes so long. They are taken for 2015 from eurusd d1, there are all sorts of prices, times and indicators. I haven't gotten out of d1 yet, so the number of training examples is only 250+, by the number of trading days in a year. No skips.

optim_var_number = 0.8662363

I'll play around with your set. Everything should fly.
 
SanSanych Fomenko:

The last 1 column says that if you take only PC1, it will explain 0.9761 variability (Cumulative Proportion), if you take TWO components - PC1 and PC2, it will explain 0.99996, etc.

PC1 PC2 PC3 PC4 PC5

Standard deviation 2.2092 0.34555 0.01057 0.008382 0.004236

Proportion of Variance 0.9761 0.02388 0.00002 0.000010 0.000000

Cumulative Proportion 0.9761 0.99996 0.99998 1.000000 1.000000

It turned out to be a bit of a mess - this method only works if you work with all possible components. The function prcomp has a parameter "tol", which is NULL by default. But you can assign to it some values from 0 to 1, to reduce the number of found components. The way it works is this: when you search for a new component, the function takes the sdev of the first component, and multiplies by tol. As soon as the sdev of the new component falls below this product, new components will stop being generated. For example in your case if tol = 0.1, then all components with sdev <0.22 will be discarded, so only two main components will remain. If tol=0.003, then only components with sdev > 0.0066276 will remain, i.e. only four. If tol=NULL (default) function will generate maximum number of components, but it takes too much time, so I want to shorten this process. If I use tol then everything works faster and there are fewer components, but in this case it breaks. It's somehow calculated based on found components only. Cumulative Proportion of the last found component will always be 1. Even if only 2 components are found instead of thousands, cumulative prop of the second one will be 1 (for example from 0.1 if all components are generated) and PC1 will consequently grow too. It is possible that theCumulative Proportion will also change incorrectly when sifting out predictors.

So Cumulative Proportion can not be trusted, if you seriously work with y-aware pca then you should write your own function to calculate the explained variability.

 
Dr.Trader:

It turned out to be a little bad - this way works only if you work with all possible components. The function prcomp has a parameter "tol", which is NULL by default. But you can assign to it some values from 0 to 1, to reduce the number of found components. The way it works is this: when you search for a new component, the function takes the sdev of the first component, and multiplies by tol. As soon as the sdev of the new component falls below this product, new components will stop being generated. For example in your case if tol = 0.1, then all components with sdev <0.22 will be discarded, so only two main components will remain. If tol=0.003, then only components with sdev > 0.0066276 will remain, i.e. only four. If tol=NULL (default) function will generate maximum number of components, but it takes too much time, so I want to shorten this process. If I use tol then everything works faster and there are fewer components, but in this case it breaks. It's somehow calculated based on found components only. Cumulative Proportion of the last found component will always be 1. Even if only 2 components are found instead of thousands of ones, cumulative prop of the second one will change to 1 (for example from 0.1 if all components are generated) and cause PC1 to grow too. It is possible that theCumulative Proportion will also change incorrectly when we drop out predictors.

So Cumulative Proportion can not be trusted, if you seriously work with y-aware pca then you should write your own function to calculate the explained variability.

The thought is interesting, not in vain I agitated you to look.
 
Dr.Trader:

I have a lot of predictors (9602), that's why it takes so long. They are taken for 2015 from eurusd d1, there are all sorts of prices, times and indicators. I haven't gotten out of d1 yet, so the number of training examples is only 250+, by the number of trading days in a year. No skips.

optim_var_number = 0.8662363

I looked at your set. Either I don't understand something, for example, not all variables are coming in, or you are making a big mistake. You have a lot of raw price values, for example, 1.1354 (MA and others). There is no way you can do that, as it is completely non-stationary data. All data should be DIFFERENT or oscillating indicators, and should be allvdostationary. Finding dependencies in such data is a completely pointless job.
 

Right, I forgot, you already said that the data had to be specially prepared. I took raw data. There are oscillators among indicators too, I will try to take only them.

By the way, PCA model works with such data, but it needs a lot of centering, scaling, and some rotations with raw data. For neuronics it is easier, it only needs normalization of data in [0...1].

 
Dr.Trader:

Right, I forgot, you already said that the data had to be specially prepared. I took raw data. There are oscillators among indicators too, I will try to take only them.

By the way, PCA model works with such data, but it needs a lot of centering, scaling, and some rotations with raw data. For neuronics it's easier, it only needs normalization of data in [0...1].

Nah, you definitely don't quite understand the importance of non-stationarity. It doesn't matter if it's a NS or linear model or my model, if your data is non-stationary, the dependencies found on it are guaranteed not to occur outside the sample. All the data that you have looks like: raw price, MA(raw price), bar opening (raw price), etc. should be removed from the model. You need to take their difference from the last known price.

The scaling into the interval cannot be done.

 
Alexey Burnakov:

If your data is non-stationary, it's guaranteed that the dependencies found on it won't occur outside the sample.

Only there is one interesting nuance that sows doubts about the adequacy of the reproduction of such data (differences from the previous value)

1) For example, if we have a price,

2) we generate its difference

3) we take two sections of the differences which are very close to each other in structure (well, even Euclidean)

4) these areas are almost 100% included into one cluster in the RF or the same neuron and are treated as identical situations

5) we take these two sections (differences) and return them to their initial prices, i.e. cumulate them

And we see that these two segments are completely different, often one segment is trending upwards, and the other one is trending downwards, i.e. there is no resemblance, but the algorithm thinks that they are identical segments...

What do you think about this, Alexey? Your comment as an experienced person is interesting

 
mytarmailS:

Only there is one interesting nuance that sows doubts about the adequacy of the reproduction of such data (differences from the previous value)

1) For example, if we have a price,

2) we generate its difference

3) we take two sections of the differences which are very close to each other in structure (well, even Euclidean)

4) these areas are almost 100% included into one cluster in the RF or the same neuron and are treated as identical situations

5) we take these two sections (differences) and return them to their initial prices, i.e. cumulate them

And we see that these two segments are completely different, often one segment is trending upwards, and the other one is trending downwards, i.e. there is no resemblance, but the algorithm thinks that they are identical segments...

What do you think about it, Alexey?

Why would it be like that? If the differences are the same, then the integral series will coincide completely. If they are just similar, the integral series will be similar (in trend).

What I wanted to tell Trader was to read the basics of data preparation. No one is submitting raw prices for entry. This has already been played out a hundred thousand times. Unsteady data leads to unsteady dependencies.