Machine learning in trading: theory, models, practice and algo-trading - page 2111

 
Aleksey Vyazmikin:

No, it would be a mere fitting, not a model with meaning!

I disagree. By quantizing you reduce the amount of information. The maximum number of quantization will leave the maximum amount of information.

But it takes longer to quantize at 65535 than at 255.

 
elibrarius:

Do you know how to do it?

Yes, I'm working on it - it was originally done for genetic trees.

You need to evaluate the distribution of information over the sample and its relationship to the target. I look at how the error decreases in a particular section of quantization and what percentage of samples it contains - balancing these metrics allows me to select the best partitions.

 
elibrarius:

I disagree. By quantizing you reduce the amount of information. The maximum number of quantization will leave the maximum amount of information.

But it takes longer to quantize 65535 than 255.

You should be wrong to disagree - there is little information out there and it needs to be separated from the noise. We (me) want stable dependencies, not ones that recur every 5 years and so there are not enough statistics to estimate their propensity for a particular target, using an insufficient number of examples just leads to a fit.

 
Aleksey Vyazmikin:

Yes, I'm working on that - it was originally done for genetic trees.

It is necessary to evaluate the distribution of information on the sample and its relation to the target. I look at how the error decreases in a particular section of quantization and what percentage of samples it contains - balancing these indicators allows you to select the best partitions.

How do you estimate the error in quantization. You can only get it by running training and by all columns at once, not each column quantized at the moment.

 
elibrarius:
Aleksey Vyazmikin:

How do you estimate the error in quantization. You can only get it by running the training on all columns at once, not on each column quantized at the moment.

I estimate the change in the balance of the targets relative to the entire sample. This is especially true if there are more than two targets.

 
Aleksey Vyazmikin:

I estimate the change in the balance of the targets relative to the entire sample. This is especially true if there are more than two targets.

In any case, the next split will split through the quantization point into 2 pieces.

You quantize large 255 chunks and you can move the boundary of the quantum quite coarsely - 5-10-20% of its size. By applying 65535 quanta you will have a step of 0.5% of your quantum. And the tree will pick the best one.

Unlikely though. Usually it just hits the middle or quarters. With 65535 quanta the middle will be more accurate, and with 255 is 256 times rougher.

 
elibrarius:

In any case the next split will be split through the quantization point into 2 parts.

You can move the boundary of a quantum quite roughly by quantizing large 255 chunks - 5-10-20% of its size. By applying 65535 quanta you will have a step of 0.5% of your quantum. And the tree will pick the best one.

Unlikely though. Usually it just hits the middle or quarters. With 65535 quanta the middle will be found more accurately, and with 255 is 256 times rougher.

Exactly, there will be a split, but the split will not be 50% but uneven - depending on the correspondence with the upper split(s), but logic suggests that the chances will be different if you look where the section is saturated with units or where there is an equal number of them (relative to the balance of the entire sample). The goal is to get at least 1% of the samples in the leaves, and at the same time about 65% of the labels of the same class.

 
Aleksey Vyazmikin:

Exactly, there will be a split, but the split will not be by 50% but unevenly - depending on the correspondence with the upper split(s), but logic suggests that the chances will be different if you search where the segment is saturated with units or where there is an equal number of them (relative to the balance of the entire sample). The goal is to get at least 1% of the samples in the leaves, and at the same time about 65% of the labels of the same class.

I think this is a very difficult task.

And if such a feature is found, then you can only work with it, even without MO.

Unfortunately we don't have such features.

 
Maxim Dmitrievsky:

I don't need it for the exam, but it may come in handy.

The results are strange - on the test and training sample Recall 0.6-0.8 and on the exam without conversion 0.009 and with conversion 0.65 - something is wrong here :(

It seems that CatBoost has learned the algorithm of transformation :)

And whether there is an opportunity to mark old and new lines? Then it is possible to remove the transformed strings from the transformed sample and to see, whether it is a problem of interpretation or not qualitative training all the same.

 
elibrarius:

I think this is a very difficult task.

And if such a feature is found, it is the only way to work, even without the MO.

Unfortunately, we don't have such features.

On y - grid partitioning, and on X - deviation as a percentage of the sum of the target of each class in the entire sample. The filter is 5%. You can see that different classes dominate in different areas, sometimes there is a mirror change - then the improvement is at the expense of a particular class (the histogram goes to the minus), and sometimes it does not. All this must be used in training, but the standard methods of training known to me do not really take this into account. It is possible that it would be more effective to overshoot with genetics (more precisely on elimination) - it is necessary to do.

Reason: