Machine learning in trading: theory, models, practice and algo-trading - page 2974

 
mytarmailS #:
It's sad that machine learning with targeting doesn't work and one-shot learning doesn't work....

remove symmetric features to reduce bias.

For example, replace increments with absolute increments (volatility).

sometimes helps

 
Maxim Dmitrievsky #:

remove symmetrical features to reduce the bias

e.g. replace increments with absolute increments (volatility)

Sometimes it helps.

No, it's much more complicated than that.
 
Aleksey Vyazmikin #:

My idea is to obtain a model that will select stable quantum segments by a number of statistical features. Everyone is welcome to join this project.

Why are you so obsessed with these quanta?...?

There's nothing intellectual about them. Simply divide for example 10000 lines into 100 pieces, i.e. sort and go from the bottom count 100 lines, if the following coincide with the hundredth line (i.e. repeats), then we refer them all to the first piece. Duplicates are over start typing lines in the second quantum - the next 100 + duplicates if there are any. And so on until we run out of rows.

Even 1 tree contains orders of magnitude more useful information (because it is trained on data) than these quantum segments (just 100 sorted strings with duplicates).

 
Forester #:

Why are you so hung up on these quants?....

There is nothing intellectual in them at all. Just divide for example 10000 lines into 100 pieces, i.e. sort and go from the bottom count 100 lines, if the next ones coincide with the hundredth line (i.e. repetitions), then we refer them all to the first piece. Duplicates are over start typing lines in the second quantum - the next 100 + duplicates if there are any. And so on until we run out of lines.

Even 1 tree contains orders of magnitude more useful information (because it is trained on data) than these quantum segments (just 100 sorted lines with duplicates).

Quantum segments are the bricks from which the CatBoost model is built. Initially, as I understood, in this way was solved the question of memory saving and in general acceleration of calculations. A side gain is the reduction of variants of combinations of predictor indicators, a step is made towards reducing multicollinearity, which in general contributes to the speed and quality of training. Plus, in part, the problem of data drift is solved.

I see a prod for something else as well - exploring the potential of probabilistic estimation in quantum segment data. If we take your proposed method of quantisation (actually it is better to imagine that the purpose of the process is to sift out homogeneous groups - analogue - clustering), and split the data into 20 equal quantum segments according to the number of examples, it turns out that in each quantum segment there is only 5% of data left. CatBoost by default creates 254 separators - 255 segments. Then trees are built from these segments. It is assumed that all quantum segments are equally useful, and their interposition should be determined by partitioning the group into subgroups, by the method of building a decision tree. The partitioning is done both by root and other predictors. Even if we take one tree, how many of the original examples of positive class "1" will remain in the final list after 6 splits? We must take into account that the split selection is based on the metrics of the cumulative number of quantum splits. Taking into account the tree construction method itself, it becomes obvious that the more qualitative the partitioning of the predictor into quantum segments is done, the less splits will be needed to achieve the same accuracy in the leaf. I note - that every split is a hypothesis, and all hypotheses cannot be true. So, if we do the splitting taking into account the potential of a quantum segment to belong more to one of the classes, we reduce the number of splits to achieve similar accuracy, and hence reduce the number of potentially false hypotheses (splits). In addition, if we can immediately split the predictor into global 3 regions - two for class membership and one for uncertainty, then the models will on average be smaller and with better statistical performance - I expect also more robust.

For example, let's imagine that the predictor is the RSI oscillator - significant actions of participants occur around the levels of 70, 50, 30 - everything beyond - let's say, does not affect the decision making of market participants. Then it is reasonable to build a quantum table in such a way that would separate these 3 values from the rest of the population. Otherwise, one of the splits will randomly have more examples in the quantum segment of class membership and you will get a false rule on a false hypothesis.

You can draw a bar chart with quantised predictor scores and draw a curve of probabilities of class membership "1" for each column. If the curve is more likely to be a straight line, then I would bench such a predictor. A good predictor, in my opinion, will have either a sloping line or spikes on some columns.

We can say that through quantisation I am looking for discrete events that influence the probability of price movement.

 
The rule is exactly what will divide
rsi>69 & rsi<71....
That's all quantisation.
You take a wooden AMO, split it into trees and pull out the right ones.

What quantisation? You're such a nonsense, it's pathetic.

It's all solved in three lines of code...
And you've been working with this quantisation for years, like a mad professor.
 
mytarmailS #:
The rule will just divide
rsi>69 & rsi<71....
That's all the quantisation...
You take a wooden AMO, split it into trees and pull out the right ones...

What quantisation? You're talking such rubbish, it's pathetic.

It's all solved in three lines of code...
And you've been tinkering with this quantisation medium for years, like a nutty professor.

There are different ways to create a quant table. I think, indeed, you can do it through an off-the-shelf package that builds trees on a single predictor with given constraints in the sheet on the percentage of examples. What that package is and how to get the data in the format I need - I don't know.

What's important is not just partitioning, but finding criteria for evaluating this quantum split that will increase the probability that class membership is preserved on the new data.

Why I am doing this - because this is the key to building a qualitative model.

Why it takes a long time - a lot of experiments and test scripts. I have a little understanding of OpenCL and the code is now partially counted on a video card - it takes time - I have to study a lot of things.

 
Aleksey Vyazmikin #:

What is important is not just partitioning, but finding criteria for evaluating this quantum cutoff that will increase the likelihood that class membership will persist on the new data.

Have you ever looked at the code in the same catbuster? You're using it. It doesn't use third-party packages. But this is such a small function (it is even simpler than what I described, it does not shift the separation point by the number of duplicates).
I wrote comments on what comes up. The input is a sorted column.

static THashSet<float> GenerateMedianBorders(
    const TVector<float>& featureValues, const TMaybe<TVector<float>>& initialBorders, int maxBordersCount) {
    THashSet<float> result;
    ui64 total = featureValues.size(); //число строк в столбце
    if (total == 0 || featureValues.front() == featureValues.back()) { // проверки
        return result;
    }

    for (int i = 0; i < maxBordersCount; ++i) { // цикл по числу квантов
        ui64 i1 = (i + 1) * total / (maxBordersCount + 1); // номер строки начала кванта номер i
        i1 = Min(i1, total - 1); // не больше числа строк 
        float val1 = featureValues[i1]; // значение в строке i1
        if (val1 != featureValues[0]) { // если != значению в 0-й строке, чтобы не было дубля с 0-й строкой
            result.insert(RegularBorder(val1, featureValues, initialBorders)); // сохранить значение в массив границ разделяющих кванты (не смотрел, но очевидно, что дубликаты пропустит и не создаст им квант)
        }
    }
    return result;
}

As you can see everything is very simple and there is nothing intellectual here - just count for example 100 rows and that's all.
Slightly more complex variants can shift to the number of duplicates, you can also optimise the size of quanta (for example, if out of 10000 rows 9000 duplicates, the simple function will be 11 quanta: 10 from the first 1000 rows and in the 11th the remaining 9000 duplicates, or you can divide the first 1000 rows into 99 quanta + 1 quantum with 9000 duplicates).
But they also have nothing intellectual - the same simple counting of the required number of strings is the basis.

The original (there are more complicated variants) https://github.com/catboost/catboost/blob/3cde523d326e08b32caf1b8b138c2c5303dc52e5/library/cpp/grid_creator/binarization.cpp
Study the functions from this page for a week and save several years.

P.s. The main reason why the next quantum has not exactly 100 strings, but 315 or 88 or 4121 is not in some super tricky formula (where strings are combined according to predictive power, which you want to use to search for criteria of evaluation of this quantum segment ), but simply in the number of duplicates.
catboost/binarization.cpp at 3cde523d326e08b32caf1b8b138c2c5303dc52e5 · catboost/catboost
catboost/binarization.cpp at 3cde523d326e08b32caf1b8b138c2c5303dc52e5 · catboost/catboost
  • catboost
  • github.com
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 
Is it necessary to rescue a person who is obviously drowning but enjoys his suicide?
When you offer a helping hand to him, he rejects it, argues with you, doesn't try to do anything to get out himself and dictates his own terms like:
"Either you rescue me yourself, carry me out of the water holding me over your head to my favourite music, or you don't.
And then you have no desire to save me at all.


 
Forester #:

Have you even looked at the code in the same catbusters? You're using it. It does not use third-party packages. But this is such a small function (it is even simpler than what I described, it does not shift the separation point by the number of duplicates).
I wrote comments on what comes up. The input is a sorted column.

Of course I looked at it. In addition, I am ready to pay for the work to reproduce all quantisation methods in MQL5. So far attempts have been unsuccessful, would you like to try?

Forester #:

As you can see, everything is very simple and there is nothing intellectual here - just count for example 100 lines and that's it.

You have given the simplest method - yes, it is not difficult.

Besides, did I write that there are ingenious quantisation methods or something like that? How does it contradict what I wrote?

 
mytarmailS #:
Is it necessary to save a person who is obviously drowning but enjoys his suicide?
When you offer a helping hand to him, he rejects it, argues with you, doesn't try to do anything to get out on his own and dictates his own terms like:
"Either you save me yourself, carry me out of the water holding me over your head to my favourite music, or you don't.
And then you have no desire to save me at all.


Sit on the shore - hero - it's your choice.

I've told you enough about what the point is and why solving the problem will increase the stability of the model.

What's the use of bragging that you know a package in R that can solve such and such a problem if I can't use it?