Machine learning in trading: theory, models, practice and algo-trading - page 3272

 

Has anyone ever tried to work with "outliers"? And I'm not talking about outliers as an error, but rare events.

It is funny, but it turned out that in some sample strings, outliers are recorded in more than 50% of predictors....

Do these outliers get into the models - quite readily, as it turns out.

So, it seems that it is not only for NS that this is critical....

 
fxsaber #:

I don't do comparisons, I provide a code that everyone for their own case can measure.

The string length of 100 is the length of the pattern. You probably don't need more than that.

15000 samples is memory limited because of the quadratic size of the correlation matrix. The more samples, the better. That's why I wrote a homemade programme where there can be a million of them.

I have neither desire nor time to spend on objective comparison. I made it for my own tasks and shared the working code. Whoever needs it will see it.

Variants of acceleration - output matrix should be stored in uchar. As a rule, we change percentages in 1% increments and look at something. uchar - just up to +- 128. You can modify your self-design for uchar, and alglib too, the code is available. Total matrices can be 8 times larger with the same memory.

 
Aleksey Vyazmikin #:

Has anyone ever tried to work with "outliers"? And I'm not talking about outliers as a bug, but rather rare events.

It's funny, but it turned out that in some strings of the sample the outlier is fixed in more than 50% of predictors....

Whether these outliers are included in the models is quite willing, as it turns out.

So, it seems that it is not only for NS that this is critical....

White/black swans... fxsaber wrote about them in his blog. I have had such variants among many variants: trading for about a week every couple of years. In general, the expert sits and waits for movement. But this is an ideal performance in the tester. In real life, slippages (they are huge during the movement) can ruin everything.

 
Forester #:

White/black swans..... fxsaber wrote about them in his blog. Among many variants I have had the following: trading for about a week every couple of years. In general, the expert sits and waits for movement. But this is an ideal performance in the tester. In real life, slippages (they are huge during the movement) can ruin everything.

I tried to remove lines from the sample where there are a lot of outliers.

And, the training changed radically.

If earlier on test there were good results on average in plus, and on exam - almost all in minus - this is out of 100 models, then cleaning of outliers changed the result - the results on test deteriorated a lot (average profit near zero), and on exam on the contrary - a lot of models became in plus.

I am not ready to say that this is a regularity yet, I will try to check it on other samples.

Besides, the question of how best to determine the outlier is still open for me. Now I just take up to 2.5% for each side, as long as the ranks do not exceed this limit.

 
In general, the feeling is that there is a cycle of a couple of years after training, through which models start to find some patterns suddenly in new data.
 
Aleksey Vyazmikin #:

I tried removing rows from the sample where there are a lot of outliers.

And, the training changed radically.

If earlier on test there were good results in average plus, and on exam - almost all in minus - this is from 100 models, then cleaning of outliers changed the result - the results on test became much worse (average profit near zero), and on exam on the contrary - many models became plus.

I am not ready to say that this is a regularity yet, I will try to check on other samples.

Besides, the question of how best to determine the outlier is still open for me. Now I just take up to 2.5% for each side, as long as the ranks do not exceed this limit.

If you take just 2.5% (or other), the number of outliers depends on the distribution, and that's not right.

Better to take a quantile with 100 plots and not remove before 1% and after 99%, but replace with values of 1% and 99%. You cannot delete anything.

 
fxsaber #:
NumPy seems to have a different algorithm from ALglib, since it differs greatly in performance. But it is clear that in the whole huge Python-community there was some very strong algorithmist who devoted a decent amount of time to studying this issue.

The source code is open, you can take a look. The function for calculating the correlation, on the right side there is [source], after clicking on it you will be taken to the code. We are interested in lines 2885-2907. In line 2889 the covariance is used, after clicking on cov, all mentions of cov in the code will appear on the right, after clicking on the line with def cov... will jump to the covariance function, and so on. MQL C-like language, all C-like languages are ~90% similar, you can understand C#, Java, Python, JavaScript without much trouble.

 
СанСаныч Фоменко #:

If you take just 2.5% (or other) then the number of deleted depends on the distribution, and that's not right.

It is better to take a quantile with 100 plots and not delete before 1% and after 99%, but replace with values of 1% and 99%. You cannot delete anything.

I take a percentage of the data, not a percentage of the range. So if there is a lot of data (it's dense), stopping will be fast on the range scale.

Jumping from the mean and variance has shown to be of little effectiveness.

Just explored the question of replacement by other values (but I did it only for quantisation), I liked the option of replacement by random values from the remaining subset, taking into account the probability.

 
Aleksey Vyazmikin #:

I take a percentage of the data, not a percentage of the range. So if there is a lot of data (it's dense), stopping will be fast on the range scale.

Jumping from the mean and variance has shown to be of little effectiveness.

Just explored the question of replacement by other values (but I did it only for quantisation), I liked the option of replacement by random values from the remaining subset, taking into account the probability.

Quantiles are probabilities. So we remove/replace data whose probability of falling into the range less than 1%/more than 99% or other values. You can't cut off quantities - we have skewed and tailed distributions.

They write that the replacement value is best taken as a prediction of that quantity by the MOE. But that seems a bit overkill to me.

 
СанСаныч Фоменко #:

Quantiles are probabilities. So we remove/replace data whose probability of falling into the range of less than 1%/more than 99% or other quantities. We cannot cut off quantities - we have skewed and tailed distributions.

They write that the replacement value is best taken as a prediction of that quantity by the MOE. But that seems a bit overkill to me.

Just the point is that it is often difficult to determine the distribution on the automaton.

Often it will be supposedly lognormal, but that's just because of outliers - there's no logic for it to be so.

And if you take a quantile, it means cutting over the whole range, which will not be enough to remove outliers.


On the second sample I got a very strange result - it was just briskly learning without any manipulations, but after removing the rows with outliers the effect of learning became almost equal to zero.

Now I switched on slow learning rate - I will put it on overnight - see if it will give anything.

Otherwise, it turns out that the whole learning process is based on memorising outliers, at least with the public predictors that I use in the experiment.

Reason: