Machine learning in trading: theory, models, practice and algo-trading - page 1783

 
Maxim Dmitrievsky:

What are the current states? If it is about clusters, you just need to check the statistics on the new data. If they are the same, you can build TC

Parameters on the bar. The increments, velocities, averages of history. The averaging lags half or a little less than the range. The averages are late by half or a little less than the averaging range, and the increments are not significant enough. And no one counts the parameters of the series as a whole. Two gradations flat and trend is not even funny.
 
Maxim Dmitrievsky:

If it is about clusters, you just need to check the statistics on the new data. If the same, you can build TC.

The subject area of clusters and statistics should be clearly understood. If the same on all instruments from '70 to '20, then it is possible)))

 
mytarmailS:

The problem is the size of the data, I will not even be able to create traits, so it will not even come to training...

Make a sample of 50k, let it be small, let it be not serious, let it be more possible to retrain, let .... ..., ... The aim is to make a robot for production, and simply reduce the error by joint creative work, and then the knowledge gained can be transferred to any tool and market, 50 000 is quite enough to see what signs mean something.

All right, I'll make a small sampling.

mytarmailS:

If you don't know OHLK, you don't have to write it, why should you displace the whole OHLK? Nobody does that, you just have to displace ZZ by one step, as if to look into the future by 1 step for training and that's all. Have you read at least one article by Vladimir Perervenko about dir lening? Please read. This is very uncomfortable when there is already a well-established optimal actions with the data and all got used to them, and someone is trying to do the same but in their own way, differently, it's kind of pointless and annoying, and the cause of many errors in people who try to work with such an author's data.

I read his articles, but I don't understand R code, so I can't really understand everything there.

So I'm going to ask you, since you understand this question. The classification takes place on the zero bar, when only the opening price is known, as I understand you do not use the opening price from the zero bar, but only information from the 1st bar and later? In fact the target determines the ZZ vector on the zero bar? I get that the vector of the next bar was predicted - that's not essential, is it? Otherwise I have to do a lot of rework again - it's tiresome.

I just have a ready solution for taking data and applying the model, not a calculation model.

mytarmailS:

If after all this you still want to do something I have the following requirements

1) the data 50-60k no more, it is better to have one file, just agree that the n of the last candles will be the test

2) The data, preferably without pasting, since it is possible to consider not only the latest prices, but also the support and resistance, it is impossible with pasting

3) The target should be already included into the data

4) Data in the format date,time,o,h,l,c, target


Or should I make a dataset ?

You can demand from those who have made a commitment - i.e. not from me :) Let's make a deal.

1. Let's have 50 for training and another 50 for test (sampling outside of training).

2. ok.

3. okay.

4. okay.

Added: Understood that there are not enough normal bars in Si-3.20 futures (22793) and you don't want gluing.


Added a sampling on the sber - I got an accurasy 67.

Files:
Setup.zip  891 kb
 
Aleksey Vyazmikin:

So I'm going to ask you, since you have figured this out. The classification takes place on the zero bar, when only the opening price is known, as I understand it, you do not use the opening price from the zero bar, but only the information from bar 1 and later? In fact the target determines the ZZ vector on the zero bar? I get that the vector of the next bar was predicted - that's not essential, is it? Otherwise I have to do a bunch of rework again - it's tiresome.

The classification is done on the last bar on which the known clause (i.e. a full-fledged OHLS candle), we predict the ZZ sign of a future candle. Why take into account the candle on which the only known option I can not understand, what is the advantage besides the complexity? both in understanding and in realization, and if you understand that the option[i] is almost always equal to the clause[i-1], then I only have one question mark for this approach


Aleksey Vyazmikin:

You can demand from those who have made a commitment - i.e. not from me :) Let's negotiate.

I do not demand anything from you personally, what are you)) The requirement for the sample, the sample must be the same for all, so that something can be compared, right? I think it's obvious.


And thank you for listening to me )

1) data 50-60k no more , preferably one file.........

Let's have 50 for training and another 50 for the test (sample outside of training).

I numbers 50-60k probably spontaneously, why not increase by 2 times? )))

)))

1) data 50-60k no more, better one file, just agree

And thank you for uploading one file instead of two! ))
 

I tried it first, out of the box, so to speak...

In the prediction only the last n values are involved, as well as you, because the error is the same.

There are 217 signs, I know there are some redundant ones, but I'm too lazy to clean them.

I had tested and validated the fileOHLC_Train.csv with 54147 of observations


tested the model on the first 10k observations (to be exact, 8k, the first 2k were not taken into account, because the indicators were calculated on them)

tested the model on the remaining 44k of data, so I think there is no retraining. the test is 5.5 times the trayn 44/8 =5.5


Of the models I tried boosting and forrest, boosting was not impressed, I stopped at forrest

in the training set a strong imbalance classes, but I am too lazy to shaman

table(d$Target[tr])

   0    1 
3335 4666 

the final model on the current signs - forrest 200 trees

on trayne...

 Type of random forest: classification
                     Number of trees: 200
No. of variables tried at each split: 14

        OOB estimate of  error rate: 14.75%
Confusion matrix:
     0    1 class.error
0 2557  778  0.23328336
1  402 4264  0.08615517

on the test

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 12449  5303
         1  9260 17135
                                          
               Accuracy : 0.6701          
                 95% CI : (0.6657, 0.6745)
    No Information Rate : 0.5083          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3381          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5734          
            Specificity : 0.7637          
         Pos Pred Value : 0.7013          
         Neg Pred Value : 0.6492          
             Prevalence : 0.4917          
         Detection Rate : 0.2820          
   Detection Prevalence : 0.4021          
      Balanced Accuracy : 0.6686          
                                          
       'Positive' Class : 0  

As you can see the results are identical to yours, and don't need millions of data 50k is quite enough to find a pattern if at all

So we got the same results, this is our starting point, now this error has to be improved

 

)) Hohma ))

I deleted all of the so-called technical analysis indicators

There are 86 indicators, not 217 like in the example above

And the quality of the model only improved )


Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 12769  5597
         1  8940 16841
                                          
               Accuracy : 0.6707          
                 95% CI : (0.6663, 0.6751)
    No Information Rate : 0.5083          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3396          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.5882          
            Specificity : 0.7506          
         Pos Pred Value : 0.6953          
         Neg Pred Value : 0.6532          
             Prevalence : 0.4917          
         Detection Rate : 0.2892          
   Detection Prevalence : 0.4160          
      Balanced Accuracy : 0.6694          
                                          
       'Positive' Class : 0 
 
mytarmailS:

Classification occurs on the last bar where the known clause (those full-fledged OHLS candle), predict the sign of ZZ future candle . Why take into account the candle on which the only known option I can not understand, what is the advantage besides the complexity? both in understanding and in realization, and if you understand that the option[i] is almost always equal to the clause[i-1], then I only have one question mark for this approach

You can't understand it because you have data in R and you can't know in the terminal when the OHLC is formed on the current bar, that's why you can get OHLC only on the zero bar from the first bar. Well, Open on the zero bar is the new time data - especially relevant for large TFs, because I have a class of the same predictors in the sample, but applied to different TFs.


mytarmailS:


1) data 50-60k no more, it is better to have one file.........

Let's give 50 for training and another 50 for the test (sample outside training).

I've probably just named 50-60k, why not double it? )))

)))

1) data 50-60k no more, better one file, just agree

And thank you for uploading one file instead of two! ))
mytarmailS:

trained and validated on the fileOHLC_Train.csv a total of 54147

trained the model on the first 10k observations (to be exact, 8k, the first 2k were not taken into account, because the indicators were calculated on them)

tested the model on the remaining 44k data, so I think there is no retraining. the test is 5.5 times the trayn 44/8 =5.5

As you can see the results are identical to yours and i don't need millions of data, 50k is enough to find the patterns if any

So we got the same results, this is our starting point, now this error must be improved

I've split the sample into two files, the first file is for any twisted attempts at learning, and the second is for checking the results of learning.

Don't you have a way to save the model and test it on the new data? If so, please check it, I gave you the result for OHLC_Exam.csv sample.

Can you send back these two files, also separated, but adding your predictors and the column with the classification result?


Regarding the retraining or lack of it.

In my opinion it's a clear overtraining.

 
Aleksey Vyazmikin:

Yes ... Everything is sadder on the new data (((

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 9215 5517
         1 3654 7787
                                          
               Accuracy : 0.6496          
                 95% CI : (0.6438, 0.6554)
    No Information Rate : 0.5083          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3007          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7161          
            Specificity : 0.5853          
         Pos Pred Value : 0.6255          
         Neg Pred Value : 0.6806          
             Prevalence : 0.4917          
         Detection Rate : 0.3521          
   Detection Prevalence : 0.5629          
      Balanced Accuracy : 0.6507          
                                          
       'Positive' Class : 0


Here are the files, Do NOT use the first 2k lines in trail

in the test the first 100 lines

UPD====

the files do not fit, send me mail in person

 
mytarmailS:

Yeah... Everything is sadder on the new data (((


Here are the files, Do NOT use the first 2k lines in trail

In the test, the first 100 lines.

There are no files in the appendix.

I changed the sampling breakdown for training and validation, for validation I took every 5 rows, I got a funny graph

The sample OHLC_Exam.csv Accuracy is 0.63


By X, each new tree decreases the result, indicating overtraining due to insufficient examples in the sample.

Compress the file with a zip.
 
Aleksey Vyazmikin:

There are no files in the appendix.

I changed the sampling breakdown for training and validation, for validation I took every 5 rows and got a funny graph

On the sample OHLC_Exam.csv Accuracy 0.63


By X, each new tree decreases the result, indicating overtraining due to insufficient examples in the sample.

Compress the file with a zip.

Yes, yes our models are over-trained...

Here is a link to download the files, even the compressed file does not fit on the forum

https://dropmefiles.com.ua/56CDZB


Try the model on my signs, I wonder what acuracy will be