Machine learning in trading: theory, models, practice and algo-trading - page 2126

 
elibrarius:
And what's the point if the spread doesn't cover it?

with duplicates always overwrite, model residuals are autocorrelated

i.e. self-deception. see the picture from the previous post.
 
Maxim Dmitrievsky:

with duplicates always overwrite, the rest of the model is autocorrelated

I.e. self-deception. See the picture from the previous post.
Picture without explanations, just a picture.)
 
elibrarius:
Picture without explanations - just a picture)

The loops in the first picture are series marks, the model is retrained on them. Because the new data have a completely different series

taken from dataset, their relations(feature space). I already wrote and threw such screenshots.

 
Maxim Dmitrievsky:

The hinges in the first picture are series marks, the model is retrained on them. Because the new data have a completely different series

taken by 5 main components from dataset, their relations (feature space). I already wrote and threw these screenshots.

If you can't win over the spread, it means that you don't need much retraining.
In my opinion, it is better to use other ways to prevent overtraining.
 
elibrarius:
If you can't beat the spread, it means that it's not overtraining much.
In my opinion, it is better not to thin, but to use other ways to combat overtraining.
The spread cannot be beaten after simple decorrelation, but the model is more stable on new data without the spread. Any model that is overfilled for seriality pours without a spread on n.d., but does much better on a tray than the first one (it also pans with a spread). This clearly shows the retraining to serial and nothing else. I know it's hard to understand, but it is 🤣 If you look at the pictures again, you'll see higher distribution peaks and maybe tails, on the first one. That's seriality, volatility, whatever. It changes almost immediately on the new data, hence the overfit. The second bottom picture doesn't have that, it's all that's left, and in that garbage you have to look for an Alpha that beats the spread. Just look at your data and at least remove the serialization, or somehow transform it to remove the tails. And then look at the class distributions of what's left, whether there are normal cluster groups or complete randomness like mine. That way you can even visually see if the dataset is working or garbage. And then you can mix validation with trayn, it won't affect anything. And you say "just a picture".
 
elibrarius:

//день недели, час = ввести через 2 предиктора sin и cos угла от полного цикла 360/7,  360/24
                     
if(nameInd[nInd]=="Hour")        {CopyTime        (sim,per,startDt,n_bar+1,dtm);TimeToStruct(dtm[0],dts);ArrayResize(tmp,1);tmp[0]=(double)(dts.hour*60+dts.min)*360.0/1440.0;tmp[0]=(buf==0?MathSin(tmp[0]*pi/180.0):MathCos(tmp[0]*pi/180.0));}// для увеличения точности добавлены минуты  360/24 = 360/24/60 = 360/1440

if(nameInd[nInd]=="WeekDay")     {CopyTime        (sim,per,startDt,n_bar+1,dtm);TimeToStruct(dtm[0],dts);ArrayResize(tmp,1);tmp[0]=(double)(dts.day_of_week*1440+dts.hour*60+dts.min)*360.0/10080.0;tmp[0]=(buf==0?MathSin(tmp[0]*pi/180.0):MathCos(tmp[0]*pi/180.0));}// для увеличения точности добавлены часы и минуты 360/7 = 360/7/24/60 = 360/10080

By code, if buf==0, it has sine, otherwise ( buf==1 ) cosine.


Wooden models digest everything.
Sine and cosine are good for NS because they are already normalized to -1...+1

If you compare this variant with numbered time, tell me which is better. Something seems to me that should match 100% if you feed the day of the week, hour and minute.

I don't quite get it - is sine or cosine a user discretion?

pi - did you get it from a library, or just precision to a certain sign, what - better write here a constant, what you set.

 
Aleksey Vyazmikin:

I do not quite understand - sine or cosine is obtained at the discretion of the user?

pi - from somewhere you took from the library, or just the accuracy to a certain sign, which - better write here a constant that you set.

There are 2 columns to feed into the model - both sine and cosine for the clock. And sine + cosine for the day of the week. See the link for description of why this should be done.

pi = 3.141529 ... from school

 

The book discussed above makes me aware of the paucity of my knowledge in mathematics, if anyone reads freely - I envy.

The question is, what is the best way to describe with one number or two a process that repeats periodically at different intervals of time? The process has a high repetition rate, some dense group, and then the frequency fades and the signal may not be 15% of the observed interval. The goal is to determine if there is no critical (percent 70%) crowding in any part of the observation period and there is not enough signal in other intervals, i.e. the closer to a uniform distribution the better, but the nature of the signal itself is far from a uniform distribution (I think so).

 
Aleksey Vyazmikin:

I do not quite understand - is the sine or cosine obtained at the discretion of the user?

pi - did you get it from the library, or is it just accurate to a certain sign, which one - you better write here the constant that you set.

You have CATboost 😑 just mark it as categorical.