Machine learning in trading: theory, models, practice and algo-trading - page 3283

 
Forester #:
I wonder, what if a matrix is computed and the same matrix is computed by fast Algibov algorithm PearsonCorrM. Who would be faster.
PearsonCorrM is 40-50 times faster than Algibov's line-by-line algorithm, probably even a fast homemade algorithm will not overcome such a speed gap.

Here is twice the lag of the homemade from PearsonCorrM.

 

I compared the speed of CatBoost training in python and through the command line:

- 20% faster from startup to saving the models, including reading the samples

- 12% faster the learning process itself

Tested on the same model - the training result is identical.

Of course, the command line is faster.

 
Aleksey Vyazmikin CatBoost training in python and via the command line:

- 20% faster from startup to saving models, including reading samples

- 12% faster the learning process itself

Tested on the same model - the training result is identical.

Of course, the command line is faster.

Do you still use the command line to run EXEs?
You can run them via WinExec and even optimise them in the tester.

#import "kernel32.dll"
   int WinExec(uchar &Path[],int Flag);
   int SleepEx(int msec, bool Alertable=false);//SleepEx(1000,false); - для простого таймера
#import 
...
string CommonPath = TerminalInfoString(TERMINAL_COMMONDATA_PATH)+ "\\Files\\";
string RAM_dir="RAM\\";//использую диск в памяти для скорости
string agent_dir=RAM_dir+"0000";//при запуске в терминале
string p=MQLInfoString(MQL_PROGRAM_PATH);// C:\Users\User\AppData\Roaming\MetaQuotes\Tester\ххххххххххххххххххххххх\Agent-127.0.0.1-3000\MQL5\Experts\testEXE.ex5
int agent_pos=StringFind(p,"Agent-");// при оптимизации запустится в папке с номером агента
if(agent_pos!=-1){agent_pos=StringFind(p,"-",agent_pos+6);agent_dir = RAM_dir+StringSubstr(p,agent_pos+1,4);}//выдаст 3001, 3002... по номеру папки в котором тестер запустился
FolderCreate(agent_dir,FILE_COMMON);
...
sinput string PathExe="С:\\your.exe";//path to exe file
uchar ucha[];
StringToCharArray(PathExe_+" --dir "+CommonPath+agent_dir+"\\",ucha);//string to train
int visible=0;
FileDelete(agent_dir+"\\model.bin",FILE_COMMON); //сначала удалить старый
int x=WinExec(ucha,visible); //visible=0 - work in hidden window, 1 - work in opened exe window - can be closed by user. Better to use 0, to run proces up to finish.
while(true){if(FileIsExist(agent_dir+"\\model.bin",FILE_COMMON)){break;}SleepEx(1000);}// используем SleepEx из DLL (Sleep() от MQL не работает в оптимизаторе, будет грузить проц на 100% проверками файла. Через DLL 1%.). Файл с моделью появился - расчет закончен.
//модель обучена, читаем файл модели и используем

I haven't tried Catboost, but I think you can do something.
The most difficult thing is to determine the moment when the training is over. I do it by checking the appearance of the model file in the folder agent_dir+"\model.bin" 1 time per second. Where Catboost puts the model file - I don't know, I may have to look for it elsewhere.

Another possible problem, if the model file is huge, it may take a long time to write, ie. The file already exists, but not written to the end. It may be necessary to pause the file or check that the file is not closed from reading by the writing process....
 
Forester #:

Do you still use the command line to run EXEs?
You can run them through WinExec and even optimise them in the tester.

I run bat files - so far it is the most convenient option for current tasks.

So far there is only one task for automatic launching of training after getting a sample - reconnaissance search. I plan to make an automatic bat file for it later.

Forester #:

I haven't tried Catboost, but I think we can think of something.

The most difficult thing is to determine the moment of the end of training. I do it by checking the appearance of the model file in the folder agent_dir+"\model.bin" 1 time per second. Where Catboost puts the model file - I don't know, I may have to look for it elsewhere.

Another possible problem, if the model file is huge, it may take a long time to write, ie. The file already exists, but not written to the end. You may need to pause the file or check that it is not closed from reading by the writing process....

Your solution is interesting - I'll look, maybe I'll take it for application.

But I think there is a possibility to get an end response from the executable programme - I wanted to use this approach.

Option with files - if you run a bootnik, then again - you can just create a new file at the end of the task.

 
Aleksey Vyazmikin #:

Maybe the relative success rate of the -1 and 0 variants in the sample size of the train, and it should be reduced? In general, it seems that Recall reacts to this.

In your opinion, should the results of such combinations be comparable to each other in our case? Or is the data irretrievably outdated?

I split the train sample into 3 parts, trained 3 sets for variant -1 and variant 0, and also trained only the original train as three samples.

This is what I got.

I made this generalisation PR=(Precision-0.5)*Recall

It seems that the training happens at the expense of the 2nd and 3rd parts of the train sample - now I have combined them and run the training - let's see what happens.

Still, it looks like this might not be a bad method for estimating the randomness of training. Ideally, the training should be relatively successful at each segment, otherwise there is no guarantee that the model will simply stop working tomorrow.

 
Aleksey Vyazmikin #:

It looks like training is happening at the expense of the 2nd and 3rd part of the sample train - I've now combined them and run the training - we'll see what happens.

Still, it looks like this might not be a bad method for assessing the randomness of training. Ideally, the training should be relatively successful at each segment, otherwise there is no guarantee that the model will simply stop working tomorrow.

And here are the results - the last two columns

Indeed, the results have improved. We can make an assumption that the larger the sample, the better the training result will be.

We should try training on the 1st and 2nd parts of the training sample - and if the results are not much worse than on the 2nd and 3rd parts, then the factor of sample freshness can be considered less significant than the volume.

 

I have written many times about the "predictive power of predictors". which is calculated as the distance between two vectors.

I came across a list of tools for calculating the distance:

library(proxy)
pr_DB$get_entry_names()
##  [1] "Jaccard"         "Kulczynski1"    
##  [3] "Kulczynski2"     "Mountford"      
##  [5] "Fager"           "Russel"         
##  [7] "simple matching" "Hamman"         
##  [9] "Faith"           "Tanimoto"       
## [11] "Dice"            "Phi"            
## [13] "Stiles"          "Michael"        
## [15] "Mozley"          "Yule"           
## [17] "Yule2"           "Ochiai"         
## [19] "Simpson"         "Braun-Blanquet" 
## [21] "cosine"          "angular"        
## [23] "eJaccard"        "eDice"          
## [25] "correlation"     "Chi-squared"    
## [27] "Phi-squared"     "Tschuprow"      
## [29] "Cramer"          "Pearson"        
## [31] "Gower"           "Euclidean"      
## [33] "Mahalanobis"     "Bhjattacharyya" 
## [35] "Manhattan"       "supremum"       
## [37] "Minkowski"       "Canberra"       
## [39] "Wave"            "divergence"     
## [41] "Kullback"        "Bray"           
## [43] "Soergel"         "Levenshtein"    
## [45] "Podani"          "Chord"          
## [47] "Geodesic"        "Whittaker"      
## [49] "Hellinger"       "fJaccard"

This is besides the standard one, which has its own set of distances

stats::dist() 
 
Aleksey Vyazmikin #:

And here's the result - the last two columns

Indeed, the results have improved. We can make an assumption that the larger the sample, the better the training result will be.

It is necessary to try to train on 1 and 2 parts of the training sample - and if the results are not much worse than on 2 and 3 parts, then the factor of sample freshness can be considered less significant than the volume.

Cuto, again we are stuck in ignorance of the mat stat.
I have an idea to montecarlise the matstat by hitting it from different sides.
 
СанСаныч Фоменко #:

I have written many times about the "predictive power of predictors". which is calculated as the distance between two vectors.

I came across a list of tools for calculating the distance:

This is besides the standard one, which has its own set of distances

Can you show me an example of how to use it?

 
Maxim Dmitrievsky #:
Cuto, we're back to not knowing the matstat
.
There's an idea to montecarlise the matstat by hitting it from different angles

This is your hour - shine your knowledge - expose the ignoramus!

And seriously, you'd better think about the results - I publish for free, although it doesn't cost me much.

Reason: