Machine learning in trading: theory, models, practice and algo-trading - page 551

 

I'd like to add my opinion: During last few days I've also been thinking about reducing the number of input variables to speed up optimization. And I must say that I have gained some success in sifting out unnecessary inputs these days.

Question: What input variables should be removed from the training sample, because they are noise????

Actually, this question is not very trivial, because if you knew which inputs are noise and which are not. Then it would be easy to build a model. But when we have all the inputs in one way or another related to the output. What then????? Which ones to remove?

For me, the answer was very simple.... You need to leave only those inputs that have a normal distribution law. When the histogram looks like a normal distribution and its central part is in the middle of the range. These are the kind of variables that can be useful for training. I'm not saying that these are the variables that have alpha for output. It may not be there. But the search itself will be more thorough and in such variables the algorithm has a better chance of catching on and increasing the number of inputs. Here is an example:

This input counts as a good input. Since it has a normal distribution and the central part of the histogram is in the middle of the range

And this data has a beveled distribution with outliers outside of the main histogram. This histogram shows that the data is skewed to one side, which is unlikely to be useful for building the model

At the initial stage of selecting the input data, we cannot judge the significance of one input or another for the output. Because that is the job for the optimizer. At the initial stage we can only judge the distribution of a variable relative to zero. And if this distribution is normal (data is evenly distributed on one and other side relatively to zero), then most likely optimizer will have more choice as opposed to skewed data relatively to zero where most of data is in negative zone or vice versa.

So, it's like this....

 

It depends on what model you choose them for :) if after removing the uninformative feature model does not lose much in accuracy, then the hell he is needed. It is deleted, retrained - see if there is something unnecessary again.

And if you have a regression model with a non-stationary process at the output, then this approach will ruin everything, because it will retrain on normally distributed noise

 
Maxim Dmitrievsky:

It depends on what model you choose them for :) if after removing the uninformative attribute model does not lose much in accuracy, then the hell he is needed. It is deleted, retrained - see if there is something unnecessary again.

And if you have a regression model with a non-stationary process at the output, then this approach will ruin everything, because it will retrain on normally distributed noises


Classification with respect to zero. For these purposes this approach is just right IMHO!

 
Mihail Marchukajtes:

But this data has a skewed distribution with outliers outside of the main histogram. This histogram shows that the data is skewed to one side, which is unlikely to be useful for model building

In Vladimir's articles, there is a point about removing outliers, if the outliers in your figure #2 are removed, you will get a more normal distribution.

And there is also centering of input data - it will improve the situation even more.

 
elibrarius:

In Vladimir's articles, there is a point about removing outliers, if the outliers in your figure #2 are removed, you will get a more normal distribution.

And then there is input centering - it will improve the situation even more.


What to do when this outlier arrives in the new data? How does the model interpret it?

Removing an outlier from the data means removing the entire input vector at a given outlier, and what if there is important data in that vector at other inputs. If the nature of the input is prone to such outliers, it is better not to take this input at all. IMHO.

 
Mihail Marchukajtes:

Classification relative to zero. For these purposes this approach is just right IMHO!


Yes, if the outputs are distributed approximately according to the same law, if not - the same retraining

 
elibrarius:

In Vladimir's articles, there is a point about removing outliers, if the outliers in your figure #2 are removed, you will get a more normal distribution.

And then there's input data centering - it will improve things even more.


Removing outliers is a statistical measure or a crutch (a la the desire to make everything stationary), which can significantly worsen the predictions on the forex and bring the whole system to naught (will only work where the market is normally distributed)

It is necessary to understand where NS is used and for what purpose... and not just reading books to do whatever you like :)

Vladimir has no proof of robustness of his models... only very rough tests of models in the same R

so I don't know what to believe in this life... everything has to be double-checked.

 
Maxim Dmitrievsky:

Yes, if the outputs are distributed roughly the same law, if not, it will be the same retraining


Well, I always balance the output for an equal number of class "0" and "1". That is, my output is balanced and I take inputs with a normal distribution relative to zero. Optimizer has to run several times, but as a rule the more inputs used in the model, the better its performance. Therefore I choose the model that is more parametric with the maximum result on the test section. Then I select the model that is parametric and has the maximal result in the test section.

 
Mihail Marchukajtes:

What to do when this outlier arrives in the new data? How will the model interpret it?

To remove the outlier from the data means to remove the entire input vector at a given outlier, and what if there is important data in this vector at other inputs. If the nature of the input is prone to such outliers, it is better not to take this input at all. IMHO.

In the new data the outliers are also removed, according to the range obtained during training. Suppose you removed from -100 to +100 in training, remembered - and removed by the same levels on the new data. Do it on absolute values, and then you can normalize. Without removing outliers, I have all the time shifted the center of the data removed and it became not comparable) to itself a week earlier.

And outliers appear only in moments of news or extraordinary events, but each time the strength of these outliers will be different. I decided for myself that it is better to discard them, and Vladimir, it is not my own idea, apparently it is confirmed by the research of many people.

 
Mihail Marchukajtes:


Well, I always balance the output for an equal number of class "0" and "1". That is, my output is balanced and I take inputs with a normal distribution with respect to zero. Optimizer has to run several times, but as a rule the more inputs used in the model, the better its performance. Therefore I choose the model that is more parametric with the maximum result on the test section. Further on, I do some boosting and other tricks...


So it's not jPredictor at last? :)