Machine learning in trading: theory, models, practice and algo-trading - page 3334
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
You have plenty of binary predictors with 0 and 1. They won't divide by 32. But if you normalise them, you may get something with Uniform quantisation. If non-uniform quanta, then just by numbers all distances will be distorted, you need to abs values after normalisation.
Yes, with binary ones it is more complicated. But I don't understand the idea how normalisation can help here.
In general, I guess, it is necessary to reduce dimensionality. But, then it is not exactly what the authors intended. So far I am far from realisation.
There will be an error in prediction if you can't get rid of noise like in training.
It's a different concept - the data is divided into two parts - like "can predict" and "can't predict" - one model is responsible for that. And when new data comes in, it is evaluated whether to make a prediction or not. Thus predictions are made only on data that was "easily" separable and tightly clustered during training, i.e. had a sign of validity.
It doesn't matter if it's tree, forest or bush. If the model prediction is 50% means there will be 50% 0's and 50% 1's in the prediction.
That's not the point, at all. Forest and bousting have forced tree construction, i.e. there is no algorithm to discard if the tree is lousy. In either case, the tree is given weights. It can be lousy because of excessive randomness in the algorithm, both when selecting features and when selecting examples (subsamples).
No, I haven't. I'll see what it is tonight.
That's right - it's a way of isolating examples that degrade learning - that's the theory.
The idea is to train 100 models and see which examples on average "hinder" reliable classification, and then try to detect them with another model.
So I took the model and I looked at the leaf count. The model is unbalanced with only 12.2% units. 17k leaves.
I made a markup of leaves into classes - if the sample of responses with target "1" was more than the initial value - 12.2%, then the class is "1", otherwise it is "0". The idea of class here is to have useful information to improve classification.
In the histogram we see the values in the leaves of the model (X) and their % in the model (Y) - without classing them.
And here it's the same, but the class is only "0".
The class is only "1."
These coefficients in the leaves are summed and transformed via logit, which means a "+" sign increases the probability of class "1" and a "-" decreases it. Overall the breakdown by class looks valid, but there is bias in the model.
Now we can look at the percentage distribution just (in terms of classification accuracy) - separately for sheets with "1" and with "0".
The histogram for "0" is a huge number of leaves with accuracy near "100%".
And here there is a larger cluster near the initial separation value, i.e. there are a lot of low-informative leaves, but at the same time there are also those near 100%.
Looking at the Recall it becomes clear that these leaves are all leaves with a small number of activations - less than 5% of their class.
Recall for class "0"
Recall for class "1".
Next we can look at the dependence of the weight in the leaf on its classification accuracy - also separately for each class.
For target "0"
For target "1".
The presence of linearity, albeit with such a large range, is noteworthy. But the "column" with a probability of 100 is out of logic, spreading very wide over the range of the sheet value.
Maybe this ugliness should be removed?
Also, if we look at the value in the leaves depending on the Recall indicator, we see a small weight in the leaves (near 0), which sometimes has a very large value of responses. This situation indicates that the leaf is not good, but the weight is attached to it. So can these leaves also be considered as noise and zeroed out?
For target "0."
For target "1."
I wonder what percentage of leaves on the new sample (not train) will "change" their class?
And in addition, a classic - the interdependence of completeness and accuracy.
Class 0.
Class one.
Anyway, I'm thinking about how to weigh that....
And this is what the model looks like in terms of probabilities.
On the train sample - as much as 35% profit starts to be made - like in a fairy tale!
On the test sample - on the range from 0.2 to 0.25 we lose a fat chunk of profit - the points of class maximums are mixed up.
On the exam sample - it is still earning, but it is already corroding the model.
I wonder what percentage of leaves on a new sample (not train) will "change" their class?
That's right - it's a way of highlighting examples that degrade learning - that's in theory.
The idea is to train 100 models and see which examples on average "interfere" with reliable classification, and then try to detect them with a different model.
Divide the maintrain into 5-10 subtrains, each of which you divide into a trail and a shaft. Train on each by cv type, then predict on the whole maintrain. You compare the original labels for all the models with the predicted labels. The ones that didn't guess are put on the blacklist. Then you remove all bad examples when training the final model by calculating the average aspiration for each sample. Optionally, you can teach the second model to separate the white samples from the black samples, either through the 3rd class.
.
Labels (teacher, target variable) can NOT be rubbish by definition. The quote is marked up from some considerations external to the predictors. Once the labels have been decided, there is the problem of predictors that are relevant to the set of labels found. It is easy to have a problem that a set of labels is beautiful, but we cannot find predictors for them and have to look for another set of labels. For example, marks are ZZ reversals. Beautiful marks. And how to find predictors for such labels?
As soon as we start filtering labels by predictors - this is super fitting, which is what everything you show here, including the market - does not work on an external, new file in a natural step-by-step mode.