Machine learning in trading: theory, models, practice and algo-trading - page 2

 

I will clarify the conditions for the cash prize:

5 credits will go to the first one to solve the problem

Deadline for solutions: June 30, 2016.

 

There is an example of applying the algorithm for selecting informative signs to implement a trading strategy.

You probably read my blogs about the Big Experiment: https://www.mql5.com/ru/blogs/post/661895

And here is such a picture:

I tried to find one pattern for five pairs and detect the percentage of correctly guessed deals on a validation sample of about 25 years. It didn't work right off the bat. I did not reach the desired accuracy for any given forecast horizon.

Next, let's take just one pair, the eurusd. I found the dependence of the price movement 3 hours ahead on a subset of my predictors.

I have reduced the predictors to a categorical form and started my function of searching for significant predictors. I did it right now, while I was at work, in 20 minutes.

[1] "1.51%"

> final_vector <- c((sao$par >= threshold), T)

> names(sampleA)[final_vector]

[1] "lag_diff_45_var"      "lag_diff_128_var"     "lag_max_diff_8_var"   "lag_max_diff_11_var"  "lag_max_diff_724_var" "lag_sd_362_var"      

[7] "output"   

I did not get convergence so quickly, but I got some result at the level of one and a half percent explanatory power.

The convergence graph (minimization).

Next is the construction of the model.

We have several categorical predictors. We build a "rule book" or what kind of dependence there is between the predictors and the output - long or short in a 3-hour time horizon.

How it looks like as a result:

уровни предикторов
sell
buy
pval
concat
direction
121121
11
31
2.03E-03
121121
1
211112
3
15
4.68E-03
211112
1
222222
19
4
1.76E-03
222222
0
222311
8
0
4.68E-03
222311
0
321113
7
0
8.15E-03
321113
0
333332
53
19
6.15E-05
333332
0

We see the skewness of the number of buy and sell in each line and the corresponding value of the p-value of the chi-square criterion for the 50/50 distribution. We select only those lines where the probability is below 0.01.

And the code of the whole experiment, starting from the moment when the inputs are already selected:

dat_test <- sampleA[, c("lag_diff_45_var"      

  , "lag_diff_128_var"     

  , "lag_max_diff_8_var"   

  , "lag_max_diff_11_var"  

  , "lag_max_diff_724_var" 

  , "lag_sd_362_var"

  , "output")]


dat_test$concat <- do.call(paste0, dat_test[1:(ncol(dat_test) - 1)])


x <- as.data.frame.matrix(table(dat_test$concat

   , dat_test$output))


x$pval <- NA

for (i in 1:nrow(x)){

x$pval[i] <- chisq.test(x = c(x$`0`[i], x$`1`[i])

, p = c(0.5, 0.5))$p.value

}


trained_model <- subset(x

  , x$pval < 0.01)


trained_model$concat <- rownames(trained_model)


trained_model$direction <- NA

trained_model$direction [trained_model$`1` > trained_model$`0`] <- 1

trained_model$direction [trained_model$`0` > trained_model$`1`] <- 0


### test model


load('C:/Users/aburnakov/Documents/Private/big_experiment/many_test_samples.R')


many_test_samples_eurusd_categorical <- list()


for (j in 1:49){


dat <- many_test_samples[[j]][, c(1:108, 122)]

disc_levels <- 3

for (i in 1:108){

naming <- paste(names(dat[i]), 'var', sep = "_")

dat[, eval(naming)] <- discretize(dat[, eval(names(dat[i]))], disc = "equalfreq", nbins = disc_levels)[,1]

}

dat$output <- NA

dat$output [dat$future_lag_181 > 0] <- 1

dat$output [dat$future_lag_181 < 0] <- 0

many_test_samples_eurusd_categorical[[j]] <- subset(dat

   , is.na(dat$output) == F)[, 110:218]

many_test_samples_eurusd_categorical[[j]] <- many_test_samples_eurusd_categorical[[j]][(nrow(dat) / 5):(2 * nrow(dat) / 5), ]


}



correct_validation_results <- data.frame()


for (i in 1:49){

dat_valid <- many_test_samples_eurusd_categorical[[i]][, c("lag_diff_45_var"      

   , "lag_diff_128_var"     

   , "lag_max_diff_8_var"   

   , "lag_max_diff_11_var"  

   , "lag_max_diff_724_var" 

   , "lag_sd_362_var"

   , "output")]

dat_valid$concat <- do.call(paste0, dat_valid[1:(ncol(dat_valid) - 1)])

y <- as.data.frame.matrix(table(dat_valid$concat

   , dat_valid$output))

y$concat <- rownames(y)


valid_result <- merge(x = y, y = trained_model[, 4:5], by.x = 'concat', by.y = 'concat')

correct_sell <- sum(subset(valid_result

 , valid_result$direction == 0)[, 2])

correct_buys <- sum(subset(valid_result

     , valid_result$direction == 1)[, 3])

correct_validation_results[i, 1] <- correct_sell

correct_validation_results[i, 2] <- correct_buys

correct_validation_results[i, 3] <- sum(correct_sell

    , correct_buys)

correct_validation_results[i, 4] <- sum(valid_result[, 2:3])

correct_validation_results[i, 5] <- correct_validation_results[i, 3] / correct_validation_results[i, 4]


}


hist(correct_validation_results$V5, breaks = 10)


plot(correct_validation_results$V5, type = 's')

sum(correct_validation_results$V3) / sum(correct_validation_results$V4)

Next, there are 49 validation samples, each covering 5 years or so. Let's validate the model on them and count the percentage of correctly guessed trade directions.

Let's see the percentage of correctly guessed trades by samples and the histogram of this value:

And count how much in total we guess the direction of the trade on all samples:

> sum(correct_validation_results$`total correct deals`) / sum(correct_validation_results$`total deals`)

[1] 0.5361318

About 54%. But without taking into account the fact that we need to overcome the distance between Ask & Bid. That is, the threshold value according to the above chart is about 53%, provided that the spread = 1 point.

That is, we have made up a simple model in 30 minutes that is easy to hardcode in the terminal, for example. And it's not even a committee. And I was looking for dependencies for 20 minutes instead of 20 hours. All in all, there is something.

And all thanks to the correct selection of informative signs.

And detailed statistics for each valid sample.

sample
correct sell correct buy total correct deals total deals share correct
1 37 10 47 85 0.5529412
2 26 7 33 65 0.5076923
3 30 9 39 80 0.4875
4 36 11 47 88 0.5340909
5 33 12 45 90 0.5
6 28 10 38 78 0.4871795
7 30 9 39 75 0.52
8 34 8 42 81 0.5185185
9 24 11 35 67 0.5223881
10 23 14 37 74 0.5
11 28 13 41 88 0.4659091
12 31 13 44 82 0.5365854
13 33 9 42 80 0.525
14 23 7 30 63 0.4761905
15 28 12 40 78 0.5128205
16 23 16 39 72 0.5416667
17 30 13 43 74 0.5810811
18 38 8 46 82 0.5609756
19 26 8 34 72 0.4722222
20 35 12 47 79 0.5949367
21 32 11 43 76 0.5657895
22 30 10 40 75 0.5333333
23 28 8 36 70 0.5142857
24 21 8 29 70 0.4142857
25 24 8 32 62 0.516129
26 34 15 49 83 0.5903614
27 24 9 33 63 0.5238095
28 26 14 40 66 0.6060606
29 35 6 41 84 0.4880952
30 28 8 36 74 0.4864865
31 26 14 40 79 0.5063291
32 31 15 46 88 0.5227273
33 35 14 49 93 0.5268817
34 35 19 54 85 0.6352941
35 27 8 35 64 0.546875
36 30 10 40 83 0.4819277
37 36 9 45 79 0.5696203
38 25 8 33 73 0.4520548
39 39 12 51 85 0.6
40 37 9 46 79 0.5822785
41 41 12 53 90 0.5888889
42 29 7 36 59 0.6101695
43 36 14 50 77 0.6493506
44 36 15 51 88 0.5795455
45 34 7 41 67 0.6119403
46 28 12 40 75 0.5333333
47 27 11 38 69 0.5507246
48 28 16 44 83 0.5301205
49 29 10 39 72 0.5416667
СОПРОВОЖДЕНИЕ ЭКСПЕРИМЕНТА ПО АНАЛИЗУ ДАННЫХ ФОРЕКСА: доказательство значимости предсказаний
СОПРОВОЖДЕНИЕ ЭКСПЕРИМЕНТА ПО АНАЛИЗУ ДАННЫХ ФОРЕКСА: доказательство значимости предсказаний
  • 2016.03.02
  • Alexey Burnakov
  • www.mql5.com
Начало по ссылкам: https://www.mql5.com/ru/blogs/post/659572 https://www.mql5.com/ru/blogs/post/659929 https://www.mql5.com/ru/blogs/post/660386 https://www.mql5.com/ru/blogs/post/661062
 

All the raw data is available at the links in the blog.

And this model is hardly very profitable. The MO is at the half-point level. But that's the direction I'm following.

 
SanSanych Fomenko:

Always learning from the past.

We look at the graph for centuries. Both on and we see "three soldiers", then we see "head and shoulders". How many such figures we have already seen and believe in these figures, we trade...

And if the task is set as follows:

1. to automatically find such figures, not for all charts, but for a particular currency pair, the ones that have occurred recently, and not three centuries ago with the Japanese when trading rice.

2. is the initial data on which we automatically search for such figures - patterns.

To answer the first question let us consider an algorithm called "random forest". 10-5-100-200 ... input variables. Then it takes the entire set of values of the variables referring to one point in time corresponding to one bar and searches for such a combination of those input variables that would correspond on the historical data to a quite certain result, for example, a BUY order. And another set of combinations for another order - SELL. A separate tree corresponds to each such set. Experience shows that the algorithm finds 200-300 trees for the input set of 18000 bars (about 3 years). This is the set of patterns, almost analogues of "heads and shoulders", and whole mouths of soldiers.

The problem with this algorithm is that such trees can pick up some specifics that are not encountered in the future. This is called "superfitting" here in the forum, "overfitting" in machine learning. It is known that the whole large set of input variables can be divided into two parts: those related to the output variable and those not related to the noise. So Burnakov tries to weed out the ones that are irrelevant to the output.

PS.

When building a trend TS (BUY, SELL) any kind of variables are related to noise!

What you see is a small part of the market and it's not the most important. Nobody builds the pyramid upside down.
 
yerlan Imangeldinov:
What you see is a small part of the market and not the most important. No one is building a pyramid upside down.
And specifically, what don't I see?
 
yerlan Imangeldinov:
What you see is a small part of the market and not the most important. No one is building a pyramid upside down.
You can add information to the system besides the price history. But you still have to train on the history. Or - Flair.
 

I tried to train the neuron on the input data, then looked at the weights. If the input data has a small weight, then it seems it is not needed. I did it via R (Rattle), thanks to SanSanych for his article https://www.mql5.com/ru/articles/1165.

input input_1 input_2 input_3 input_4 input_5 input_6 input_7 input_8 input_9 input_10 input_11 input_12 input_13 input_14 input_15 input_16 input_17 input_18 input_19 input_20
weight -186.905 7.954625 -185.245 14.88457 -206.037 16.03497 190.0939 23.05248 -182.923 4.268967 196.8927 16.43655 5.419367 8.76542 36.8237 5.940322 8.304859 8.176511 17.13691 -0.57317
subset yes yes yes yes yes yes

I have not tested this approach in practice, I wonder if it worked or not. I would take input_1 input_3 input_5 input_7 input_9 input_11

Случайные леса предсказывают тренды
Случайные леса предсказывают тренды
  • 2014.09.29
  • СанСаныч Фоменко
  • www.mql5.com
В статье описано использование пакета Rattle для автоматического поиска паттернов, способных предсказывать "лонги" и "шорты" для валютных пар рынка Форекс. Статья будет полезна как новичкам, так и опытным трейдерам.
 
Dr.Trader:

I tried to train the neuron on the input data, then looked at the weights. If the input data has a small weight, then it seems that it is not needed. I did it with R (Rattle), thanks to SanSanych for his article https://www.mql5.com/ru/articles/1165.

input input_1 input_2 input_3 input_4 input_5 input_6 input_7 input_8 input_9 input_10 input_11 input_12 input_13 input_14 input_15 input_16 input_17 input_18 input_19 input_20
weight -186.905 7.954625 -185.245 14.88457 -206.037 16.03497 190.0939 23.05248 -182.923 4.268967 196.8927 16.43655 5.419367 8.76542 36.8237 5.940322 8.304859 8.176511 17.13691 -0.57317
subset yes yes yes yes yes yes

I have not tested this approach in practice, I wonder if it worked or not. I would take input_1 input_3 input_5 input_7 input_9 input_11

) hmm. it's very interesting.

Clarifying question. Why don't you then include some more inputs where the weight is small, such as 13, 14, 16? Could you show a diagram of inputs and weights ordered by weight?

Sorry, didn't understand at first. Yes these inputs have high modulo weight, as they should be.

 

Visually, all weights are divided into two groups. If you need to divide them according to the principle of significant/non-significant, then 5,11,7,1,3,9 clearly stand out, this set I think is enough.