Machine learning in trading: theory, models, practice and algo-trading - page 2109

 
Maxim Dmitrievsky:

select all files and download them, they will be zipped

different sampling lengths then if a part

Thank you, it is - you can download the archive, which is nice!

But the different lengths of samples - it's bad, I thought to allocate the most random columns, where small deviations are acceptable.

I think that on the sample does not need to apply this method - otherwise how to use it in the real world.

I start it for training, let's see what happens.

 
Aleksey Vyazmikin:

Thank you, it is - you can download the archive, which is nice!

But the different lengths of samples - this is bad, I thought to allocate the most random columns, where small deviations are acceptable.

I think that on the sample does not need to apply this method - otherwise how to use it in the real world.

I'm running it for training, let's see what happens.

I don't need it for exams, but it may come in handy.

 
elibrarius:

Too lazy to convert)
I'll explain the point:

1) we sort the column
2) we count the average number of elements in a quantum, for example 10000 elements / 255 quanta = 39,21
3) in the loop we move by 39,21 elements at each step and add the value from the sorted array to the array of quantum values. I.e. array value 0 = value 0 quantum, 39th value = 1 quantum, 78th value = 2 quantum, etc.

If a value is already in the array, i.e. if we get into an area with many duplicates, we skip the duplicate and don't add it.

At each step, we add exactly 39.21, and then round up the sum to select the element in the array, so it would be equal. I.e. instead of 195 (39*5 = 195) elements take 196 ( 39,21 * 5 = (int) 196,05).

With uniform distribution is clear - I would first create an array of unique values and use it for cutting.

But there are other methods of splitting the grid:

    THolder<IBinarizer> MakeBinarizer(const EBorderSelectionType type) {
        switch (type) {
            case EBorderSelectionType::UniformAndQuantiles:
                return MakeHolder<TMedianPlusUniformBinarizer>();
            case EBorderSelectionType::GreedyLogSum:
                return MakeHolder<TGreedyBinarizer<EPenaltyType::MaxSumLog>>();
            case EBorderSelectionType::GreedyMinEntropy:
                return MakeHolder<TGreedyBinarizer<EPenaltyType::MinEntropy>>();
            case EBorderSelectionType::MaxLogSum:
                return MakeHolder<TExactBinarizer<EPenaltyType::MaxSumLog>>();
            case EBorderSelectionType::MinEntropy:
                return MakeHolder<TExactBinarizer<EPenaltyType::MinEntropy>>();
            case EBorderSelectionType::Median:
                return MakeHolder<TMedianBinarizer>();
            case EBorderSelectionType::Uniform:
                return MakeHolder<TUniformBinarizer>();
        }
 
Aleksey Vyazmikin:

With a uniform distribution is clear - I would first create an array of unique values and use it to cut.

But there are other methods of splitting the grid:

There must be a lot of samples, otherwise the model won't learn anything.

 
Maxim Dmitrievsky:

there must be a lot of samples, otherwise the model will not learn anything

These are the sample quantization methods for CatBoost - these are the boundaries by which the enumeration/learning then goes on.

My experiments show that the grid should be chosen for each predictor separately, then the quality gain is observed, but this is not able to do CatBoost, and I do not know how to build a grid and I have to build a grid and upload to csv, and then search them in order to assess the behavior of the target in them. I think it's a very promising tool, but I need to translate the code into MQL.

 
Aleksey Vyazmikin:

These are the sample quantization methods for CatBoost - it is by these boundaries that the enumeration/learning then goes on.

My experiments show that grid should be chosen for each predictor separately, then the quality gain is observed, but CatBoost can't do it, and I can't build a grid and I have to build grids and upload them to csv, and then iterate through them to evaluate behavior of targets in them. I think this is a very promising feature, but I need to translate the code into MQL.

Do you have it in the settings of the model (parameters)? I don't know what it is

if not in the settings, then it's bullshit.

 
Maxim Dmitrievsky:

in the settings of the model itself (parameters) is it? I dunno what it is

If it's not in the settings, then bullshit.

It is in the settings, at least for the command line

--feature-border-type

Thequantization mode for numerical features.

Possible values:
  • Median
  • Uniform
  • UniformAndQuantiles
  • MaxLogSum
  • MinEntropy
  • GreedyLogSum
Quantization - CatBoost. Documentation
  • catboost.ai
Mode How splits are chosen Combine the splits obtained in the following modes, after first halving the quantization size provided by the starting parameters for each of them: Maximize the value of the following expression inside each bucket: Minimize the value of the following expression inside each bucket: Maximize the greedy approximation of...
 
Aleksey Vyazmikin:

It is in the settings, at least for the command line

--feature-border-type

Thequantization mode for numerical features.

Possible values:
  • Median
  • Uniform
  • UniformAndQuantiles
  • MaxLogSum
  • MinEntropy
  • GreedyLogSum

Does it make a big difference? It should be within a percent

 
Aleksey Vyazmikin:

With a uniform distribution is clear - I would first create an array of unique values and use it to cut.

But there are other methods to divide the grid:

If you have unique values, it will be a mess. For example, a total of 100 strings of which 10 are unique, 2 of them by 45 strings and 8 by 1. Divided by 5 quanta, it's possible that only 5 by 1 will be chosen, and the 2 most representative (by 45) will be skipped.
 
Maxim Dmitrievsky:

And that makes a big difference? It should be within a percent difference.

Choosing the right partitioning makes a big difference.

Here's an example on Recall - up to 50% variation - for me that's significant.

Increasing the bounds from 16 to 512 in increments of 16 - though not in order on the histogram - my titles are a bit of a hindrance.


I am still experimenting with mesh selection, but it is already obvious that there are different predictors, for which I need different meshes, to observe the logic, and not only to fit.