Machine learning in trading: theory, models, practice and algo-trading - page 21

 
Dr.Trader:

I tried Y-scale too, R^2 in both cases (with and without Y-scale) came out the same (even though different packages are used in these cases!).

I understood that Y-scale can give the same good result with fewer principal components. But, if even using all components the result is still unsatisfactory (as I have now) - then there is no difference. This way works faster, which is more important for me now. But I haven't proved in theory or in practice if this method is suitable for picking predictors... At first I had an idea to make a principal component model for all predictors and pick the predictors by looking at the coefficients of the components. But then I noticed that with the addition of garbage - R^2 of the model drops. It makes sense to try different sets of predictors and look for those with R^2 higher, but still it's just a theory.

I regularly make the following suggestion here: if you distill your set to me, we will compare my results with yours.

For me, the ideal is .RData. A frame in which the target is binary and the predictors are preferably real numbers.

 
Dr.Trader:

I used to train the forest, and return an error on validation sample. In principle it worked - if the forest overtrains even a little bit, the error immediately tends to 50%.

Now I use GetPCrsquared(), that code above. I also have your example from feature_selector_modeller.txt, but I have to figure out and get the needed fragment of code there, so I haven't tested it on my data yet.

What you need to take there is this:

library(infotheo) # measured in nats, converted to bits

library(scales)

library(GenSA)


#get data

sampleA <- read.table('C:/Users/aburnakov/Documents/Private/dummy_set_features.csv'

, sep= ','

, header = T)




#calculate parameters

predictor_number <- dim(sampleA)[2] - 1

sample_size <- dim(sampleA)[1]

par_v <- runif(predictor_number, min = 0, max = 1)

par_low <- rep(0, times = predictor_number)

par_upp <- rep(1, times = predictor_number)



#load functions to memory

shuffle_f_inp <- function(x = data.frame(), iterations_inp, quantile_val_inp){

mutins <- c(1:iterations_inp)

for (count in 1:iterations_inp){

xx <- data.frame(1:dim(x)[1])

for (count1 in 1:(dim(x)[2] - 1)){

y <- as.data.frame(x[, count1])

y$count <- sample(1 : dim(x)[1], dim(x)[1], replace = F)

y <- y[order(y$count), ]

xx <- cbind(xx, y[, 1])

}

mutins[count] <- multiinformation(xx[, 2:dim(xx)[2]])

}

quantile(mutins, probs = quantile_val_inp)

}



shuffle_f <- function(x = data.frame(), iterations, quantile_val){

height <- dim(x)[1]

mutins <- c(1:iterations)

for (count in 1:iterations){

x$count <- sample(1 : height, height, replace = F)

y <- as.data.frame(c(x[dim(x)[2] - 1], x[dim(x)[2]]))

y <- y[order(y$count), ]

x[dim(x)[2]] <- NULL

x[dim(x)[2]] <- NULL

x$dep <- y[, 1]

rm(y)

receiver_entropy <- entropy(x[, dim(x)[2]])

received_inf <- mutinformation(x[, 1 : dim(x)[2] - 1], x[, dim(x)[2]])

corr_ff <- received_inf / receiver_entropy

mutins[count] <- corr_ff

}

quantile(mutins, probs = quantile_val)

}


############### the fitness function

fitness_f <- function(par){

indexes <- c(1:predictor_number)

for (i in 1:predictor_number){

if (par[i] >= threshold) {

indexes[i] <- i

} else {

indexes[i] <- 0

}

}

local_predictor_number <- 0

for (i in 1:predictor_number){

if (indexes[i] > 0) {

local_predictor_number <- local_predictor_number + 1

}

}

if (local_predictor_number > 1) {

sampleAf <- as.data.frame(sampleA[, c(indexes[], dim(sampleA)[2])])

pred_entrs <- c(1:local_predictor_number)

for (count in 1:local_predictor_number){

pred_entrs[count] <- entropy(sampleAf[count])

}

max_pred_ent <- sum(pred_entrs) - max(pred_entrs)

pred_multiinf <- multiinformation(sampleAf[, 1:dim(sampleAf)[2] - 1])

pred_multiinf <- pred_multiinf - shuffle_f_inp(sampleAf, iterations_inp, quantile_val_inp)

if (pred_multiinf < 0){

pred_multiinf <- 0

}

pred_mult_perc <- pred_multiinf / max_pred_ent

inf_corr_val <- shuffle_f(sampleAf, iterations, quantile_val)

receiver_entropy <- entropy(sampleAf[, dim(sampleAf)[2]])

received_inf <- mutinformation(sampleAf[, 1:local_predictor_number], sampleAf[, dim(sampleAf)[2]])

if (inf_corr_val - (received_inf / receiver_entropy) < 0){

fact_ff <- (inf_corr_val - (received_inf / receiver_entropy)) * (1 - pred_mult_perc)

} else {

fact_ff <- inf_corr_val - (received_inf / receiver_entropy)

}

} else if (local_predictor_number == 1) {

sampleAf<- as.data.frame(sampleA[, c(indexes[], dim(sampleA)[2])])

inf_corr_val <- shuffle_f(sampleAf, iterations, quantile_val)

receiver_entropy <- entropy(sampleAf[, dim(sampleAf)[2]])

received_inf <- mutinformation(sampleAf[, 1:local_predictor_number], sampleAf[, dim(sampleAf)[2]])

fact_ff <- inf_corr_val - (received_inf / receiver_entropy)

} else  {

fact_ff <- 0

}

return(fact_ff)

}



########## estimating threshold for variable inclusion


iterations = 5

quantile_val = 1


iterations_inp = 1

quantile_val_inp = 1


levels_arr <- numeric()

for (i in 1:predictor_number){

levels_arr[i] <- length(unique(sampleA[, i]))

}


mean_levels <- mean(levels_arr)

optim_var_num <- log(x = sample_size / 100, base = round(mean_levels, 0))


if (optim_var_num / predictor_number < 1){

threshold <- 1 - optim_var_num / predictor_number

} else {

threshold <- 0.5

}



#run feature selection


start <- Sys.time()


sao <- GenSA(par = par_v, fn = fitness_f, lower = par_low, upper = par_upp

     , control = list(

      #maxit = 10

        max.time = 1200

        , smooth = F

        , simple.function = F))


trace_ff <- data.frame(sao$trace)$function.value

plot(trace_ff, type = "l")

percent(- sao$value)

final_vector <- c((sao$par >= threshold), T)

names(sampleA)[final_vector]

final_sample <- as.data.frame(sampleA[, final_vector])


Sys.time() - start

In the dataframe, the rightmost column is the target column.

ALL columns should be categories (ineteger, character or factor).

And you have to load all the bibbles.

A piece of code that shows how to translate numerics into categorical variables:

disc_levels <- 3 # сколько равночастотных уровней переменной создается


for (i in 1:56){

naming <- paste(names(dat[i]), 'var', sep = "_")

dat[, eval(naming)] <- discretize(dat[, eval(names(dat[i]))], disc = "equalfreq", nbins = disc_levels)[,1]

}

 

I found this interesting function on the Internet

data_driven_time_warp <- function (y) {
  cbind(
    x = cumsum(c(0, abs(diff(y)))),
    y = y
  )
}


y <- cumsum(rnorm(200))+1000
i <- seq(1,length(y),by=10)
op <- par(mfrow=c(2,1), mar=c(.1,.1,.1,.1))
plot(y, type="l", axes = FALSE)
abline(v=i, col="grey")
lines(y, lwd=3)
box()
d <- data_driven_time_warp(y)
plot(d, type="l", axes=FALSE)
abline(v=d[i,1], col="grey")
lines(d, lwd=3)
box()
par(op)

Maybe in this form the algorithm will recognize the data better? But there is one problem, the output of the function is a variable "d" and it has a matrix with two columns "x" and "y", one denotes, as it were, the price of the second curved by the algorithm time, the question is how to turn this matrix into a vector, so it does not lose its properties

 
SanSanych Fomenko:

I regularly make the following suggestion: if you transfer your set to me, we will compare my results with yours.

For me, the ideal is a .RData. A frame in which the binary target and the predictors are preferably real numbers.

Atachment is my best set of predictors. TrainData is D1 for eurusd for 2015, fronttestData is January 1, 2016 through June. Fronttest is a bit long, in real trading I am unlikely to trade more than a month with the same settings, I just wanted to see how long the profitability of the model really lasts. FronttestData1, fronttestData2, fronttestData3 are separate cuts from fronttestData, only for January, only for February, only for March. I'm really only interested in lowering the error on fronttestData1, the rest is just for research. The predictor set contains mostly indicators and different calculations between them. With nnet the error on fronttest I have 30% on fronttestData1, training with iteration control and fitting the number of internal neurons. I think the 30% here is just a matter of chance, the model caught some trend in the market from March2015 to February2016. But the fact that the other periods do not merge is already good.

Here is a picture from mt5 tester 2014.01-2016.06, I marked training period with a frame. It's already better than it was.) For now this is my limit, I have to solve a lot of problems with indicators, namely the fact that their default parameters are strictly tied to timeframes, for example on H1 my experience is completely useless, the same algorithm for selecting indicators on H1 all considers garbage. I should either add a bunch of their variations with different parameters to the initial set of indicators or generate random indicators from ohlc by myself.

 
Alexey Burnakov:

There you have to take this:

That makes more sense, thank you. It seems to me that only 3 categories per indicator will not work. Logically, I would make at least 100 levels, but is it better or does it loose the sense of the algorithm?

 
Dr.Trader:

That makes more sense, thank you. It seems to me that only 3 categories per indicator will not do. Logically, I would make at least 100 levels, but would it be better, or it will lose all meaning of the algorithm?

The meaning will be lost then. The algorithm counts the total number of levels of input variables and how response levels are distributed across these levels. Accordingly, if the number of response values at each of the input levels is very low, it would be impossible to estimate the statistical significance of the probability skew.

If you make 100 levels, yes there will be many variables. Then the algorithm will return zero significance for any subset, which is reasonable given the limited sample size.

The example is a good one.

input levels | number of observations

1 150

2 120

...

9 90

Here we can estimate the significance within the response

Example - bad.

input levels

112 5

...

357 2

...

1045 1

here it is not possible to estimate the significance within the response

 
Dr.Trader:

Atachment is my best set of predictors. TrainData - D1 for eurusd for 2015, fronttestData - from January 1, 2016 to June. Fronttest is a bit long, in real trading I am unlikely to trade for more than a month with the same settings, I just wanted to see how long the profitability of the model really lasts. FronttestData1, fronttestData2, fronttestData3 are separate cuts from fronttestData, only for January, only for February, only for March. I'm really only interested in lowering the error on fronttestData1, the rest is just for research. The predictor set contains mostly indicators and different calculations between them. With nnet the error on fronttest I have 30% on fronttestData1, training with iteration control and fitting the number of internal neurons. I think the 30% here is just a matter of chance, the model caught some trend in the market from March2015 to February2016. But the fact that the other periods do not merge is already good.

Here is a picture from mt5 tester 2014.01-2016.06, I marked training period with a frame. It's already better than it was.) For now this is my limit, I have to solve a lot of problems with indicators, namely the fact that their default parameters are strictly tied to timeframes, for example on H1 my experience is completely useless, the same algorithm for selecting indicators on H1 all considers garbage. It is necessary either to add to the original set of indicators a bunch of their variations with different parameters, or somehow to generate random indicators from ohlc.

Not bad, but the periods themselves outside the sample are small.

It is also not clear how many trades fall outside the sample. There are dozens, hundreds, what is the order?

 
Dr.Trader:

Atachment is my best set of predictors. TrainData - D1 for eurusd for 2015, fronttestData - from January 1, 2016 to June. Fronttest is a bit long, in real trading I am unlikely to trade for more than a month with the same settings, I just wanted to see how long the profitability of the model really lasts. FronttestData1, fronttestData2, fronttestData3 are separate cuts from fronttestData, only for January, only for February, only for March. I'm really only interested in lowering the error on fronttestData1, the rest is just for research. The predictor set contains mostly indicators and different calculations between them. With nnet the error on fronttest I have 30% on fronttestData1, training with iteration control and fitting the number of internal neurons. I think the 30% here is just a matter of chance, the model caught some trend in the market from March2015 to February2016. But the fact that the other periods do not merge is already good.

Here is a picture from mt5 tester 2014.01-2016.06, I marked training period with a frame. It's already better than it was.) For now this is my limit, I have to solve a lot of problems with indicators, namely the fact that their default parameters are strictly tied to timeframes, for example on H1 my experience is completely useless, the same algorithm for selecting indicators on H1 all considers garbage. It is necessary either to add to the original set of indicators a bunch of their variations with different parameters, or somehow to generate random indicators from ohlc.

I took a look at it.

Did I understand correctly that there are 107 lines (107 observations) in the total dataset?

 
SanSanych Fomenko:

Looked at.

Did I understand correctly that the total dataset has 107 lines (107 observations)?

No, the training set has 250-something rows (number of trading days in 2015). I trained the model on the trainData table. I tested it on fronttestData1. Everything else is for additional checks, you can ignore them

trainData - all year 2015.
fronttestData1 - January 2016
fronttestData2 - February 2016
fronttestData3 - March 2016
fronttestData - January 2016 - June 2016

 
Dr.Trader:

No, the training set has 250-something rows (the number of trading days in 2015). I trained the model on the trainData table. I tested it on fronttestData1. Everything else is for additional checks, you can ignore them.

TrainData - the whole year 2015.
fronttestData1 - January 2016
fronttestData2 - February 2016
fronttestData3 - March 2016
fronttestData - January 2016 - June 2016

For me it is very little - I use statistics. Even for the current window, 107 rows is very little for me. I use over 400 for the current window.

Generally, in your sets the number of observations is comparable to the number of predictors. These are very specific sets. Somehow I have seen that such sets require special methods. No references, as I do not have such problems.

Unfortunately my methods are not suitable for your data.

Reason: