Discussing the article: "Cross-validation and basics of causal inference in CatBoost models, export to ONNX format"

 

Check out the new article: Cross-validation and basics of causal inference in CatBoost models, export to ONNX format.

The article proposes the method of creating bots using machine learning.

Just as our conclusions are often wrong and need to be verified, the results of predictions from machine learning models should be double-checked. If we turn the process of double-checking on ourselves, we get self-control. Self-control of machine learning model comes down to checking its predictions for errors many times in different but similar situations. If the model makes few errors on average, it means it is not overtrained, but if it makes mistakes often, then there is something wrong with it.

If we train the model once on selected data, then it cannot perform self-control. If we train a model many times on random subsamples, and then check the quality of the prediction on each and add up all the errors, we get a relatively reliable picture of the cases where it actually turns out to be wrong and the cases it often gets right. These cases can be divided into two groups and separated from each other. This is similar to conducting walk-forward validation or cross-validation, but with additional elements. This is the only way to achieve self-control and obtain a more robust model.

Therefore, it is necessary to conduct cross-validation on the training dataset, compare the model’s predictions with training labels and average the results across all folds. Those examples that were predicted incorrectly on average should be removed from the final training set as erroneous. We should also train a second model on all the data, which distinguishes well-predictable cases from poorly predictable ones, allowing us to cover all possible outcomes more fully.

Author: Maxim Dmitrievsky

 
Если модель в среднем мало ошибается, значит она не переобучена, если же ошибается часто, значит с ней что-то не так.
However, what a catch this proposal comes with! If the models are ranked by error rate and the best ones are taken from them, this is again overtraining.
 
Поэтому необходимо провести кросс-валидацию на тренировочном датасете, сравнить предсказания модели с обучающими метками и усреднить результаты по всем фолдам. Те примеры, которые в среднем были предсказаны неверно, следует удалить из финальной обучающей выборки как ошибочные. Еще следует обучить вторую модель уже на всех данных, которая отличает хорошо предсказуемые случаи от плохо предсказуемых, позволяя наиболее полно охватить все возможные исходы. 
The first model trades, the second model classifies (and predicts) weak trading locations. Right?
 
fxsaber #:
The first model trades, the second model classifies (and predicts) weak trading locations. Right?
Yes
 
fxsaber #:
However, what a catch this proposal comes with! If the models are ranked by error rate and the best ones are taken from them, it is again an overtraining.
Well, there should always be a choice :)
The main thing is that all models +- pass OOS.
This is one of dozens of algorithms, the easiest to understand. Because from the feedback from past articles it seemed that readers just do not understand what is going on. Then what is the point of writing.
 
Interesting discussion specifically on statistical methods in the MoD, if there are those who have something to say/add to it.
 

1) I would like to see the performance of the model on the third sample, which was neither a trace nor a test and was not involved in any way in the creation and selection of the model.

2) Noise detection and label relabelling or meta-labelling was described by Vladimir in his 2017 article where he used the NoiseFiltersR package for this purpose.

Глубокие нейросети (Часть III). Выбор примеров и уменьшение размерности
Глубокие нейросети (Часть III). Выбор примеров и уменьшение размерности
  • www.mql5.com
Эта статья продолжает серию публикаций о глубоких нейросетях. Рассматривается выбор примеров (удаление шумовых), уменьшение размерности входных данных и разделение набора на train/val/test в процессе подготовки данных для обучения.
 
mytarmailS #:

1) I would like to see the performance of the model on the third sample, which was neither a t-train nor a test and was not involved in any way in the creation and selection of the model.

2) Noise detection and label relabelling or meta-labelling was described by Vladimir in his 2017 article where he used the NoiseFiltersR package for this purpose.

The bot is attached to the article

It describes a few of tens or hundreds of similar methods, there is no desire to delve into each of them, especially without verifying the results. I'm more interested in self-designs and testing them immediately, now converting to ONNX allows this to be done even faster. The core approach is easy to add/rewrite without changing the rest of the code, which is also very cool. This example of finding bugs via cv has a flaw that doesn't allow to talk about causal inference fully, so this is an introduction. I'll try to explain it some other time.

The article is useful even already because it is a ready-made solution for experimenting with MO. The functions are optimised and work fast.
 
Great, love your articles. I learnt from them. I also made a material, now on test, on exporting random forest model in ONNX)I will try your model too)I hope to publish it, I'm a beginner=).
 
Yevgeniy Koshtenko random forest model in ONNX)I will try your model too)I hope to publish, I'm a beginner=).

More MO's are only welcome :) I'm an amateur too.

 
You can't reverse the order that way:
.
   int k = ArraySize(Periods) - 1;
   for(int i = 0; i < ArraySize(Periods); i++) {
      f[i] = features[i];
      k--;
   }
It should be
f[k] = features[i];
Why reverse the order at all?