Machine learning in trading: theory, models, practice and algo-trading - page 2112

 
Aleksey Vyazmikin:

On y is the grid partitioning, and on X the deviation as a percentage of the sum of the target of each class in the whole sample. The filter is 5%. You can see that different classes dominate in different areas, there is a mirror change - then the improvement occurs at the expense of a particular class (the histogram goes to the minus), and sometimes it does not. All this must be used in training, but the standard methods of training known to me do not really take this into account. It may be more effective to overshoot with genetics (more precisely on elimination) - you have to do it.

Let's say you find a good quant in which 65% of the 1st grade examples.
The split happens for example in the middle, let them split by this quantum of yours.

1)From split, all examples from your quantum with 65% of needed examples will go to the left branch, and from another bunch of quanta which are smaller by value of your quantum. As a result we will get not 65%, but a different percentage - much smaller due to dilution with examples from other quanta.

2) Second, if your quant is not the first split in the tree, then all previous splits removed about 50% of the examples from the sample. And at level 5 of the tree, 1/(2^5)=1/32 of the examples in your quant will be left, mixing with the same thinned quanta as in the first case. That is, 65% of the examples are unlikely to remain in the sheet as a result of training.

The only option is to mark the columns as categorical features after quantization - then if it is the first split in the tree - then all 65% of examples will go to the left branch without mixing with other quanta. If it's not the first split in the tree - then you'll get thinning by upper splits again.

 
elibrarius:

Suppose you found a good quantum in which 65% of 1st class examples.
Splitting occurs for example in the middle, let them split on this quantum of yours.

1)From split to the left branch all examples from your quantum with 65% of needed examples and from another bunch of quanta which are less by value of your quantum will go to the left branch. As a result we will get not 65%, but a different percentage - much smaller due to dilution with examples from other quanta.

2) Second, if your quant is not the first split in the tree, then all previous splits removed about 50% of the examples from the sample. And at level 5 of the tree, 1/(2^5)=1/32 of the examples in your quant will be left, mixing with the same thinned quanta as in the first case. That is, 65% of the examples are unlikely to remain in the sheet as a result of training.

The only option is to mark the columns as categorical features after quantization - then if it is the first split in the tree - then all 65% of examples will go to the left branch without mixing with other quanta. If it is not the first split in the tree - then again we get thinning by upper splits.

1-2 - yes it can be like that, but not necessarily, you need an approach that will minimize that probability.

And about categorical features - that's true, but for MQ there is no interpreter of model with categorical features.

So far I see a solution in consolidating the quantum segments under one value and creating a separate sample where these values occur - that way we will be guaranteed to work with that subset. I'll do this in the search for leaves, but I need to quantize quickly with different methods in the beginning.

 
Aleksey Vyazmikin:

The results are strange - on the test and training sample Recall 0.6-0.8, and on the exam without conversion 0.009, and with the conversion of 0.65 - something is wrong here :(

It seems that CatBoost has learned the algorithm of transformation :)

And whether there is an opportunity to mark old and new lines? Then it is possible to remove the transformed strings from the transformed sample and to see whether it is a problem of interpretation or not qualitative training all the same.

it should be, there are fewer examples of the same class on the new data. Here sort of generalizability should be better, we should race in the tester at once

with my data resampling doesn't help

new rows are added to the end, if you subtract the original dataset remains. This method just adds examples by the nearest neighbor method to the minor class. I.e. creates plausible new labels and features

 
Aleksey Vyazmikin:

1-2 - yes it can be so, but not necessarily, you need an approach that will minimize this probability.

And about categorical features - this is true, but for MQ there is no model interpreter with categorical features.

So far I see a solution in consolidating the quantum segments under one value and creating a separate sample where these values occur - that way we will be guaranteed to work with that subset. I will do this in the search for leaves, but at the beginning we need to quantize quickly using different methods.

This is the algorithm for building the tree. You can't change it. Unless you write your own Catboost.

 
Maxim Dmitrievsky:

and it should be, there are fewer examples of the same class on the new data. Here kind of generalizability should be better, you need to race in the tester at once

with my data resampling doesn't help

new rows are added to the end, like, if you subtract the original dataset remains. This method just adds examples by the nearest neighbor method to the minor class. I.e. creates plausible new labels and features

So Recall has to stay high otherwise there is no point. It doesn't depend on sample balance.

I understand the principle, thank you.

Is there some method with clustering "Cluster Centroids" - or what else to try from here.

5 главных алгоритмов сэмплинга
5 главных алгоритмов сэмплинга
  • habr.com
Работа с данными — работа с алгоритмами обработки данных. И мне приходилось работать с самыми разнообразными на ежедневной основе, так что я решил составить список наиболее востребованных в серии публикаций. Эта статья посвящена наиболее распространённым способам сэмплинга при работе с данными.
 
elibrarius:

This is the algorithm for building the tree. You can't change it. Unless you write your own Catboost.

That's what I'm talking about - you have to make your own algorithm.

 
Aleksey Vyazmikin:

So Recall must remain high or there is no point. It does not depend on the balance of the sample.

I understand the principle, thank you.

There is some method with clustering "Cluster Centroids" - or something else to try from here.

This on the contrary removes tags from the major class

 
Maxim Dmitrievsky:

this one, on the contrary, removes labels from the major class

So let's remove zeros in a clever way, maybe it will have an effect.

 
Aleksey Vyazmikin:

So delete the zeros in a clever way, maybe it will have an effect.

In the notebook, just replace the method and that's it.

from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)
X_resampled, y_resampled = cc.fit_resample(X, y)

Here is an example

https://imbalanced-learn.readthedocs.io/en/stable/under_sampling.html

I prefer Near-Miss (by the pictures)

3. Under-sampling — imbalanced-learn 0.5.0 documentation
  • imbalanced-learn.readthedocs.io
On the contrary to prototype generation algorithms, prototype selection algorithms will select samples from the original set . Therefore, is defined such as and . In addition, these algorithms can be divided into two groups: (i) the controlled under-sampling techniques and (ii) the cleaning under-sampling techniques. The first group of methods...
 
Maxim Dmitrievsky:

In the laptop just change the method and that's it

I must have changed it in the wrong place.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-e8cb95eddaea> in <module>()
      1 cc = ClusterCentroids(random_state=0)
----> 2 X_resampled, y_resampled = cc.fit_resample(X, y)

NameError: name 'X' is not defined

Please check what's wrong.

Reason: