Machine learning in trading: theory, models, practice and algo-trading - page 2112
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
On y is the grid partitioning, and on X the deviation as a percentage of the sum of the target of each class in the whole sample. The filter is 5%. You can see that different classes dominate in different areas, there is a mirror change - then the improvement occurs at the expense of a particular class (the histogram goes to the minus), and sometimes it does not. All this must be used in training, but the standard methods of training known to me do not really take this into account. It may be more effective to overshoot with genetics (more precisely on elimination) - you have to do it.
Let's say you find a good quant in which 65% of the 1st grade examples.
The split happens for example in the middle, let them split by this quantum of yours.
1)From split, all examples from your quantum with 65% of needed examples will go to the left branch, and from another bunch of quanta which are smaller by value of your quantum. As a result we will get not 65%, but a different percentage - much smaller due to dilution with examples from other quanta.
2) Second, if your quant is not the first split in the tree, then all previous splits removed about 50% of the examples from the sample. And at level 5 of the tree, 1/(2^5)=1/32 of the examples in your quant will be left, mixing with the same thinned quanta as in the first case. That is, 65% of the examples are unlikely to remain in the sheet as a result of training.
The only option is to mark the columns as categorical features after quantization - then if it is the first split in the tree - then all 65% of examples will go to the left branch without mixing with other quanta. If it's not the first split in the tree - then you'll get thinning by upper splits again.
Suppose you found a good quantum in which 65% of 1st class examples.
Splitting occurs for example in the middle, let them split on this quantum of yours.
1)From split to the left branch all examples from your quantum with 65% of needed examples and from another bunch of quanta which are less by value of your quantum will go to the left branch. As a result we will get not 65%, but a different percentage - much smaller due to dilution with examples from other quanta.
2) Second, if your quant is not the first split in the tree, then all previous splits removed about 50% of the examples from the sample. And at level 5 of the tree, 1/(2^5)=1/32 of the examples in your quant will be left, mixing with the same thinned quanta as in the first case. That is, 65% of the examples are unlikely to remain in the sheet as a result of training.
The only option is to mark the columns as categorical features after quantization - then if it is the first split in the tree - then all 65% of examples will go to the left branch without mixing with other quanta. If it is not the first split in the tree - then again we get thinning by upper splits.
1-2 - yes it can be like that, but not necessarily, you need an approach that will minimize that probability.
And about categorical features - that's true, but for MQ there is no interpreter of model with categorical features.
So far I see a solution in consolidating the quantum segments under one value and creating a separate sample where these values occur - that way we will be guaranteed to work with that subset. I'll do this in the search for leaves, but I need to quantize quickly with different methods in the beginning.
The results are strange - on the test and training sample Recall 0.6-0.8, and on the exam without conversion 0.009, and with the conversion of 0.65 - something is wrong here :(
It seems that CatBoost has learned the algorithm of transformation :)
And whether there is an opportunity to mark old and new lines? Then it is possible to remove the transformed strings from the transformed sample and to see whether it is a problem of interpretation or not qualitative training all the same.
it should be, there are fewer examples of the same class on the new data. Here sort of generalizability should be better, we should race in the tester at once
with my data resampling doesn't help
new rows are added to the end, if you subtract the original dataset remains. This method just adds examples by the nearest neighbor method to the minor class. I.e. creates plausible new labels and features
1-2 - yes it can be so, but not necessarily, you need an approach that will minimize this probability.
And about categorical features - this is true, but for MQ there is no model interpreter with categorical features.
So far I see a solution in consolidating the quantum segments under one value and creating a separate sample where these values occur - that way we will be guaranteed to work with that subset. I will do this in the search for leaves, but at the beginning we need to quantize quickly using different methods.
This is the algorithm for building the tree. You can't change it. Unless you write your own Catboost.
and it should be, there are fewer examples of the same class on the new data. Here kind of generalizability should be better, you need to race in the tester at once
with my data resampling doesn't help
new rows are added to the end, like, if you subtract the original dataset remains. This method just adds examples by the nearest neighbor method to the minor class. I.e. creates plausible new labels and features
So Recall has to stay high otherwise there is no point. It doesn't depend on sample balance.
I understand the principle, thank you.
Is there some method with clustering "Cluster Centroids" - or what else to try from here.
This is the algorithm for building the tree. You can't change it. Unless you write your own Catboost.
That's what I'm talking about - you have to make your own algorithm.
So Recall must remain high or there is no point. It does not depend on the balance of the sample.
I understand the principle, thank you.
There is some method with clustering "Cluster Centroids" - or something else to try from here.
This on the contrary removes tags from the major class
this one, on the contrary, removes labels from the major class
So let's remove zeros in a clever way, maybe it will have an effect.
So delete the zeros in a clever way, maybe it will have an effect.
In the notebook, just replace the method and that's it.
from imblearn.under_sampling import ClusterCentroids cc = ClusterCentroids(random_state=0) X_resampled, y_resampled = cc.fit_resample(X, y)
Here is an example
https://imbalanced-learn.readthedocs.io/en/stable/under_sampling.html
I prefer Near-Miss (by the pictures)
In the laptop just change the method and that's it
I must have changed it in the wrong place.
Please check what's wrong.