Machine learning in trading: theory, models, practice and algo-trading - page 57

 
Yury Reshetov:
Well, stability is achieved by preventing potential overlearning. And unbalanced training sample is a potential cause of overtraining for low-representative classes. After all, the learning algorithm tries to act as it sees fit rather than as it needs to in order to increase generalizability. If the sample is unbalanced, it will minimize learning errors on the least representative classes, since there are few examples for such classes and it is easiest to rote them by heart instead of doing generalization. After such rote learning, there is nothing to be surprised that outside the training sample, algorithm errors are most likely to occur in classes that are less representative.
I have nothing against balancing the training sample. I have nothing against taking a random sub-sample for validation without cutting by dates. It would overestimate the metric on validation.
 
Alexey Burnakov:

The point is simple. In the real world, no one will allow you to take a mixed sample containing observations from the future to assess the quality of real trading. All observations will go after day X.

Hence, by taking a mixed sample in validation (without date separation), you are overestimating the quality metric on validation. That's it. Then there will be unpleasant surprises.

What does thought have to do with it? Thought is subjective. Thinking can be right and knowingly wrong. Because imagination is problematically limited. The criterion of truth is always experience.

Take two training samples, one pre-balanced and the other highly unbalanced. Train the algorithm on both samples and measure the generalization ability on the test parts. After that, compare the generalization ability. The one that gives the best generalizing ability will be the criterion of truth.

Otherwise we will think and speculate to the point of losing our pulse. After all, disagreement is born in argument, but truth is born in experience.

And so I'll stop with the further discussion of the topic of balancing the training sample. Otherwise this hullabaloo can continue indefinitely, as there are two different opinions and continue to measure who of us thinks more correctly - a waste of time.

 
Yury Reshetov:
Well, stability is achieved by preventing overtraining. Unbalanced training sample is a potential cause of overtraining for low-representative classes. After all, the learning algorithm tries to act as it sees fit, not as it needs to in order to increase generalizability. If the sample is unbalanced, it will minimize learning errors on the least representative classes, since there are few examples for such classes and it is easiest to rote them by heart instead of doing generalization. After such rote learning, it is not surprising that, outside the training sample, the algorithm's errors on the less representative classes are most likely.

1. On unbalanced classes, you get a hell of a lot: the error between classes can vary by several times. Which is the right one?

2. It is far from always possible to balance classes.

Your example with BUY|SELL. When the number of observations (bars) is over 3000, the difference in unbalance will be 10% at most 20%. It is quite possible to balance.

And hereDr. Trader has suggested the target variable "Pivot/Not Pivot". I think he took it from ZZ. So in such a target variable, the unbalanced classes will be different by orders of magnitude. If we augment to the maximum class, can we teach on such a balanced sample? It seems to me that no.

So it's not that simple with balancing.

From my own experience:

  • if the imbalance is not large (no more than 20%) then balance should be mandatory.
  • If the imbalance is large (multiple), then you cannot balance, and you should abandon this target variable.

I could not find other solutions.

 
Yury Reshetov:
I'm going to drop everything and become an adept of R in order to play numbers with a serious face on my face.
So far I haven't noticed you playing numbers, but I don't know about your face - I can't see it.
 
SanSanych Fomenko:
So far I haven't noticed that you're playing numbers, but I don't know about your face - I can't see it.
Well, it seems that on my avatar my face is quite serious, isn't it? At least I tried very hard to make it as serious as possible. But apparently it didn't turn out very well?
 
Yury Reshetov:

What does thought have to do with it? Thought is subjective. Thinking can be right or wrong. Because imagination is problematically limited. The criterion of truth is always experience.

Take two training samples, one pre-balanced and the other highly unbalanced. Train the algorithm on both samples and measure the generalization ability on the test parts. After that, compare the generalization ability. The one that gives the best generalizing ability will be the criterion of truth.

Otherwise we will think and speculate to the point of losing our pulse. After all, disagreement is born in argument, but truth is born in experience.

And so I am done with further discussion of the topic of the balance of the training sample. Otherwise this hullabaloo can continue indefinitely, as there are two different opinions and continue to measure who among us thinks right - a waste of time.

I mean one thing, you mean another. I say we should divide the set strictly by dates. And you're talking about balance.
 
Alexey Burnakov:
I mean one thing, you mean another. I say we should divide the set strictly by dates. And you speak about balance.

I'm sorry, but I've already said that I don't see any point in continuing this chorus. I already tried to explain the shortcomings of balance with examples. I guess I wasn't very convincing, was I? Well, I'm not good at black rhetoric, so I'm not good at giving black for white with a serious look on my face. So don't be too hard on me.

Most likely the reason is that you're trying to convince me that you're supposedly trying to "force" you to balance reality? But I have no such intention. I know that reality, and unfortunately for me, is often unbalanced and the possibilities for balancing it are not always available. So in my posts I tried to explain to you that it is not necessary to try to balance the real reality outside of the training sample, but it is necessary and enough just to balance the training sample, so that the model obtained from it would not be skewed in the direction of strongly representative classes. When dividing the general sample into parts by dates, it is also often impossible to achieve balance. That's why I balance training sample not by dates, but by equal representativity of classes in it.

I won't answer any more questions about balancing of the training set. So this hooliganism has already dragged on.

 
Yury Reshetov:

I'm sorry, but I've already said that I don't see any point in continuing this chorus. I already tried to explain the shortcomings of balance with examples. I guess I wasn't very convincing, was I? Well, I'm not good at black rhetoric, so I'm not good at giving black for white with a serious look on my face. So don't be too hard on me.

Most likely the reason is that you're trying to convince me that you're supposedly trying to "force" you to balance reality? But I have no such intention. I know that reality, and unfortunately for me, is often unbalanced and the possibilities for balancing it are not always available. So in my posts I tried to explain to you that it is not necessary to try to balance the real reality outside of the training sample, but it is necessary and enough just to balance the training sample, so that the model obtained from it would not be skewed in the direction of strongly representative classes. When dividing the general sample into parts by dates, it is also often impossible to achieve balance. So I balance the training sample not by dates, but by equal representativeness of classes in it.

I will not answer any more questions about balancing the training sample. So, this hullabaloo has gone on long enough.

Okay. I will not convince you.
 

I want to interject to complete the picture and repeat my opinion, expressed above the branch.

1. It is necessary to have two sets of data: the second is an extension of the first in time.

2. Balance the first data set. We definitely balance it.

3. We randomly divide the first data set into three parts: training, testing and validation.

  • We teach the model using cross validation on the training dataset
  • We run the trained model through the test and validation sets.
  • If error in all three sets is approximately equal, then proceed to step 4. Otherwise we proceed to search for more decent predictors, since considerable difference in error proves that the model is over-trained due to the presence of noise predictors (having weak relation to the target variable).

4. We obtain an error on the second set, which is a continuation of the first set in time.

If the error on all FOUR sets is about the same, then the model is not retrained. If the error has a decent value, then we can safely go on, i.e. run it through the tester.

If there is a significant difference (more than 30%), then the initial set of predictors leads to retraining of the model and from personal experience replacing the model type cannot fix anything, in terms of retraining. We need to get rid of the noise predictors. It can easily turn out that there are NO noise predictors among the predictors at all.

 
I will support your conversation gentlemen, because I use the optimizer Yuri, for more than a year and I absolutely agree with him about the sampling with the HSPF generator. The fact is that the task is to identify the information in the input data about the output information. That is, the optimizer tells you how informative the input data is for our output (which is ideal). That is, the optimizer answers this question. And if the data shows a bad result, then it does not carry information about the output, or rather it carries to the level of generalization that the predictor gives out. Now imagine a case, suppose we have 10 inputs. The question is, how many records (stroi) do we need to have in order to saw the sample to zero???? I'll give you a hint. 100 records at 10 inputs should optimize to zero. Because at 100 entries will be performed a complete search of the data. Maybe I did not express myself clearly, I apologize. Yuri certainly does not talk about it, but there is a nuance of using the predictor, which is not advertised, but which increases the generality of any data. Ie with 10 inputs, giving 100 lines, even if completely irrelevant to the system data. The algorithm will build a model where the generalizability will be high. In the range of 90% or higher. It is not certain that in the future this model will work adequately, because the data is taken from the ceiling and does not relate to the system in any way. But the Predictor is able to saw a multidimensional space with minimal error. But for this you need to perform one tricky data manipulation. But I completely agree with Yuri. The task is to identify the informativity of inputs with respect to the output and the order of the data in this case does not play any role. HSPF in this case is an option....
Reason: