How to improve the accuracy of a random forest model - General

Maxim Dmitrievsky 2024.03.17 09:48 #34361

mytarmailS #:
No, check your own code, it's only 3 lines.

I see...

Max, [17 марта 2024 г., 16:45:58]:
library(TTR)
library(zoo)
library(randomForest)
library(caret)

price <- rnorm(10000) |> cumsum()

X <- price |> rollapplyr(10, \(x) x-x[10])
Y <- price |> ZigZag(change = 0.1,percent = F) |> diff() |> sign() |> as.factor() |> tail(nrow(X)) 

tr <- 1:5000
ts <- 5001:nrow(X)

randomForest(Y[tr]~., X[tr,]) |> predict(X[ts,]) |> confusionMatrix(Y[ts])       

Confusion Matrix and Statistics

          Reference
Prediction   -1    1
        -1 2431   49
        1    43 2468
                                          
               Accuracy : 0.9816          
                 95% CI : (0.9774, 0.9851)
    No Information Rate : 0.5043          
    P-Value [Acc > NIR] : <2 e-16          
                                          
                  Kappa : 0.9631          
                                          
 Mcnemar's Test P-Value : 0.6022          
                                          
            Sensitivity : 0.9826          
            Specificity : 0.9805          
         Pos Pred Value : 0.9802          
         Neg Pred Value : 0.9829          
             Prevalence : 0.4957          
         Detection Rate : 0.4871          
   Detection Prevalence : 0.4969          
      Balanced Accuracy : 0.9816          
                                          
       'Positive' Class : -1    

Почему этот код выдает accuracy 0.98?

ChatGPT & Midjourney | AI bot, [17 марта 2024 г., 16:46:02]:
Этот код выдаёт точность (accuracy) 0.98, потому что он использует модель случайного леса (random forest) для классификации данных. В данном случае, модель обучается на данных до индекса 5000 и тестируется на данных после этого индекса.

Точность 0.98 означает, что 98% всех прогнозов были правильными. То есть модель правильно предсказала класс (1 или -1) в 98% случаев.

Это достаточно высокая точность, что может быть связано с хорошим качеством данных или эффективностью модели случайного леса в данной задаче.

но алгоритм обучается на случайных данных

Да, модель случайного леса (random forest) действительно использует случайные подвыборки данных для обучения. Это делается для уменьшения переобучения модели и повышения её обобщающей способности.

Каждое дерево в случайном лесе обучается на различной подвыборке данных, которая формируется путем случайного выбора наблюдений с возвращением из обучающего набора данных. Также, при построении каждого узла дерева, выбирается случайное подмножество признаков для учета в разделении.

Этот подход позволяет улучшить обобщающую способность модели и снизить вероятность переобучения. Таким образом, хотя модель обучается на случайных подвыборках данных, она все равно может достичь высокой точности на тестовых данных.

Maxim Dmitrievsky 2024.03.17 09:51 #34362

mytarmailS #:
No, check yourself, the code is 3 lines long.

Bard's a beauty:

The code you provided achieves a high accuracy (0.98) likely due to a few reasons:

Data Leakage:

The ZigZag function you use to generate the target variable (Y) utilises information from the entire price series ( price ). It calculates the difference in price relative to a certain percentage change from the beginning of the series. This leaks information about future price movements into the training data (Y[tr]).
Since the random forest model is trained on this "leaky" data, it can easily predict the target variable on unseen data (Y[ts]) that also uses the entire price series for ZigZag calculations. This creates an artificially high accuracy that doesn't reflect real-world performance.

Limited Lookback:

The model uses a rolling window of 10 past prices (X) to predict the direction of the next price change (Y). In reality, price movements can be influenced by factors beyond the immediate past 10 observations.

Limited Data Size:

With only 10000 data points, the model might be overfitting the training data, leading to high accuracy on the training set but potentially poor performance on unseen data.

Here's how the code could be improved for a more realistic evaluation:

Separate ZigZag Calculations: Calculate the ZigZag values for the entire price series only after the training and testing splits are defined. This ensures the model doesn't have access to future price information during training.
Increase Lookback Window: Experiment with larger window sizes in the rollapplyr function to capture a broader context for predicting price movements.
Use Proper Evaluation: Implement cross-validation techniques like k-fold cross-validation to get a more robust estimate of the model's generalisability.

By addressing these limitations, you can obtain a more realistic accuracy score that reflects the model's ability to predict price direction on unseen data.

Machine Learning and Neural Python in algorithmic trading Market etiquette or good

mytarmailS 2024.03.17 10:42 #34363

Maxim Dmitrievsky #:

The bard is handsome:

handsome is not the word, he makes up believable nonsense with a bang.

Y <- price |> ZigZag(change = 0.1,percent = F) |> diff() |> sign() |> as.factor() |> tail(nrow(X))

♪ he's counting the percentages ♪)

===============

The problem was that I messed up with the target offset, it did look one index into the future, although it was not an obvious error at all....

correct code

library(TTR)
library(zoo)
library(randomForest)
library(caret)

price <- rnorm(10000) |> cumsum()

X <- price |> rollapplyr(10, \(x) x-x[10])

add_1 <- function(x) c(x,1)

Y <- price |> ZigZag(change = 0.1,percent = F) |> diff() |> sign() |> add_1() |> as.factor() |> tail(nrow(X)) 

tr <- 1:5000
ts <- 5001:nrow(X)

randomForest(Y[tr]~., X[tr,]) |> predict(X[ts,]) |> confusionMatrix(Y[ts])

Confusion Matrix and Statistics

          Reference
Prediction   -1    1
        -1 1161 1126
        1  1330 1374
                                          
               Accuracy : 0.5079          
                 95% CI : (0.4939, 0.5219)
    No Information Rate : 0.5009          
    P-Value [Acc > NIR] : 0.1644          
                                          
                  Kappa : 0.0157          
                                          
 Mcnemar's Test P-Value : 4.2e-05         
                                          
            Sensitivity : 0.4661          
            Specificity : 0.5496          
         Pos Pred Value : 0.5077          
         Neg Pred Value : 0.5081          
             Prevalence : 0.4991          
         Detection Rate : 0.2326          
   Detection Prevalence : 0.4582          
      Balanced Accuracy : 0.5078          
                                          
       'Positive' Class : -1

Maxim Dmitrievsky 2024.03.17 12:57 #34364

mytarmailS #:

He's a good-looking guy. He makes up believable bullshit.

He's calculating the percentage.)

===============

The problem was that I messed up with the target offset, it did look one index into the future, although it was not an obvious error at all....

correct code

Well, it was around here somewhere.)

I started R a long time ago, so I don't remember what all those symbols mean.

Maxim Dmitrievsky 2024.03.17 13:24 #34365

Somewhere in a parallel universe, if all DCs were spread-free :)

Maxim Dmitrievsky 2024.03.17 13:55 #34366

Maxim Dmitrievsky #:

Somewhere in a parallel universe, if all DCs were spread-free :)

But such TCs make good martins. When you can't beat the spread in every trade, but if you average it, you can

mytarmailS 2024.03.17 15:40 #34367

Maxim Dmitrievsky #:

Somewhere in a parallel universe, if all DCs were spread-free :)

Make your tester to enter at the opening of the next candle after the signal, not at the close, you will be delighted with 99% less than before.

Rorschach 2024.03.17 16:11 #34368

Maxim Dmitrievsky #:

Somewhere in a parallel universe, if all DCs were spread-free :)

Yep, I have a similar AR model of 2k shows like this

Maxim Dmitrievsky 2024.03.17 19:05 #34369

mytarmailS #:
Make your tester to enter at the opening of the next candle after the signal, not at the close, you will be admired 99% less than before.

It enters at the close of the current candle, which is equal to the opening of the next candle.

Maxim Dmitrievsky 2024.03.17 19:08 #34370

Rorschach #:

Yep, I have a similar AR model of 2k's showing up

And the lower the TF, the prettier it is.

Machine learning in trading: theory, models, practice and algo-trading - page 3437