Is there a pattern to the chaos? Let's try to find it! Machine learning on the example of a specific sample.

 

Actually, I suggest downloading the file from the link. There are 3 csv files in the archive:

  1. train.csv - the sample on which you need to train.
  2. test.csv - auxiliary sample, it can be used during training, including merged with train.
  3. exam.csv - a sample that does not participate in training.

The sample itself contains 5581 columns with predictors, the target in 5583 column "Target_100", columns 5581, 5582, 5584, 5585 are auxiliary, and contain:

  1. 5581 column "Time" - date of the signal
  2. 5582 column "Target_P" - direction of the trade "+1" - buy / "-1" - sell
  3. 5584 column "Target_100_Buy" - financial result from buying
  4. 5585 column "Target_100_Sell" - financial result from selling.

The goal is to create a model that will "earn" more than 3000 points on exam.csv sample.

The solution should be without peeking into exam, i.e. without using data from this sample.

To maintain interest - it is desirable to tell about the method that allowed to achieve such a result.

Samples can be transformed in any way you like, including changing the target sample, but you should explain the nature of the transformation so that it is not a pure fit to the exam sample.

 
Aleksey Vyazmikin:

Actually, I suggest downloading the file from the link. There are 3 csv files in the archive:

  1. train.csv - the sample on which you need to train.
  2. test.csv - auxiliary sample, it can be used during training, including merged with train.
  3. exam.csv - a sample that does not participate in training in any way.

The sample itself contains 5581 columns with predictors, the target in 5583 column "Target_100", columns 5581, 5582, 5584, 5585 are auxiliary, and contain:

  1. 5581 column "Time" - date of the signal
  2. 5582 column "Target_P" - direction of the trade "+1" - buy / "-1" - sell
  3. 5584 column "Target_100_Buy" - financial result from buying
  4. 5585 column "Target_100_Sell" - financial result from selling.

The goal is to create a model that will "earn" more than 3000 points on exam.csv sample.

The solution should be without peeking into exam, i.e. without using data from this sample.

To maintain interest - it is desirable to tell about the method that allowed to achieve such a result.

Samples can be transformed in any way you want, including changing the target, but you should explain the essence of the transformation, so that it is not a pure fitting to the exam sample.

There is, of course
 
spiderman8811 #:
Of course there is.

You want to try and prove it?

 

Training what is called out of the box with CatBoost, with the settings below - with Seed brute force gives this probability distribution.

FOR %%a IN (*.) DO (                                                                                                                                                                                                                                                                            
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_8\result_4_%%a       --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 8         --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_16\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 16        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_24\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 24        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_32\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 32        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_40\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 40        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_48\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 48        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_56\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 56        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_64\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 64        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_72\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 72        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
catboost-1.0.6.exe fit   --learn-set train.csv   --test-set test.csv     --column-description %%a        --has-header    --delimiter ;   --model-format CatboostBinary,CPP       --train-dir ..\Rezultat\RS_80\result_4_%%a      --depth 6       --iterations 1000       --nan-mode Forbidden    --learning-rate 0.03    --rsm 1         --fold-permutation-block 1      --boosting-type Plain   --l2-leaf-reg 6         --loss-function Logloss         --use-best-model        --eval-metric Logloss   --custom-metric Logloss         --od-type Iter  --od-wait 100   --random-seed 80        --random-strength 1     --auto-class-weights SqrtBalanced       --sampling-frequency PerTreeLevel       --border-count 32       --feature-border-type Median            --bootstrap-type Bayesian       --bagging-temperature 1         --leaf-estimation-method Newton         --leaf-estimation-iterations 10                
)                                                                                                                                                                                                                                                                               

1. Sampling train

2. Sample test

3. Exam sample

As you can see, the model prefers to classify all almost everything by zero - so there is less chance to make a mistake.

 

The last 4 columns

With 0 class apparently the loss should be in both cases? I.e. -0.0007 in both cases. Or if the buy|sell bet is still made, will we make a profit in the right direction?

 
The 1/-1 direction is selected by a different logic, i.e. the MO is not involved in the direction selection? Do we just need to learn 0/1 to trade/not trade (when the direction is rigidly chosen)?
 
elibrarius #:

Last 4 columns

With 0 class apparently the loss should be in both cases? I.e. -0.0007 in both cases. Or if the buy|sell bet is still made, will we make a profit in the right direction?

With zero grade - do not enter the trade.

I used to use 3 targets - that's why the last two columns with fin results instead of one, but with CatBoost I had to switch to two targets.

elibrarius #:
The 1/-1 direction is selected by a different logic, i.e. the MO is not involved in the direction selection? You just have to learn 0/1 trade/no trade (when direction is rigidly chosen)?

Yes, the model only decides whether to enter or not. However, within the framework of this experiment it is not forbidden to learn a model with three targets, for this purpose it is enough to transform the target taking into account the direction of entry.

 
Aleksey Vyazmikin #:

If the class is zero - do not enter the transaction.

Earlier I used to use 3 targets - that's why the last two columns with financial result instead of one, but with CatBoost I had to switch to two targets.

Yes, the model only decides whether to enter or not. However, within the framework of this experiment it is not prohibited to teach the model with three targets, for this purpose it is enough to transform the target taking into account the direction of entry.

I.e. if at 0 class(do not enter) the correct direction of the trade is chosen, will profit be made or not?
 
Aleksey Vyazmikin #:

If the class is zero - do not enter the transaction.

Earlier I used to use 3 targets - that's why the last two columns with financial result instead of one, but with CatBoost I had to switch to two targets.

Yes, the model only decides whether to enter or not. However, within the framework of this experiment it is not prohibited to teach the model with three targets, for this purpose it is enough to transform the target taking into account the direction of entry.

Catbusta has multiclass, it's strange to abandon 3 classes

 
elibrarius #:
I.e. if at 0 class(do not enter) the correct direction of the transaction will be chosen, then there will be profit or not?

There will be no profit (if you do revaluation, there will be a small percentage of profit at zero).

It is possible to redo the target correctly only by breaking "1" into "-1" and "1", otherwise it is a different strategy.

elibrarius #:

Catbusta has multiclass, it's strange that they abandoned 3 classes

There is, but there is no integration in MQL5.

There is no model unloading into any language at all.

Probably, it is possible to add a dll library, but I can't figure it out on my own.

 
Aleksey Vyazmikin #:

There will be no profit (if you do a revaluation there will be a small percentage of profit at zero).

Then there is little point in financial result columns. There will also be errors of 0 class forecast (instead of 0 we will forecast 1). And the price of the error is unknown. That is, the balance line will not be built. Especially since you have 70% of class 0. I.e. 70% of errors with unknown financial result.
You can forget about 3000 points. If it does, it will be unreliable.

I.e. there is no point in solving the problem....