Machine learning in trading: theory, models, practice and algo-trading - page 213
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
Another simulation example.
We build 20,000 linear models (1,000 observations everywhere, number of predictors from 1 to 20 (1,000 models for each number), plus one independent variable). The data are i.i.d., N(0,1).
The purpose of the simulation is to make sure that the F-statistic does not exceed a critical value when the MNC regression is built on independent data (that do not contain dependencies) that satisfy the requirements of the lin.model. So, it can be used as an indicator of model training.
############### simulate lm f-stats with random vars
rm(list=ls());gc()
library(data.table)
library(ggplot2)
start <- Sys.time()
set.seed(1)
x <- as.data.table(matrix(rnorm(21000000, 0, 1), ncol = 21))
x[, sampling:= sample(1000, nrow(x), replace = T)]
lm_models <- x[,
{
lapply(c(1:20), function(x) summary(lm(data = .SD[, c(1:x, 21), with = F], formula = V21 ~ . -1))$'fstatistic'[[1]])
}
, by = sampling
]
lm_models_melted <- melt(lm_models, measure.vars = paste0('V', c(1:20)))
crtitical_f_stats <- qf(p = 0.99, df1 = c(1:20), df2 = 1000, lower.tail = TRUE, log.p = FALSE)
boxplot(data = lm_models_melted, value ~ variable); lines(crtitical_f_stats, type = 's', col = 'red')
Sys.time() - start
gc()
Code running time: 1.35 minutes.
Useful Code. Visualizes transaction sequences in three hypostases.
##########################
rm(list=ls());gc()
library(data.table)
library(ggplot2)
library(gridExtra)
library(tseries)
start <- Sys.time()
set.seed(1)
x <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) #random normal value with positive expectation
x[, variable:= rep(1:1000, times = 1000)]
x[, trade:= 1:.N, by = variable]
x.cast = dcast.data.table(x, variable ~ trade, value.var = 'V1', fun.aggregate = sum)
x_cum <- x.cast[, as.list(cumsum(unlist(.SD))), by = variable]
monte_trades <- melt(x_cum, measure.vars = names(x_cum)[-1], variable.name = "trade", value.name = 'V1')
setorder(monte_trades, variable, trade)
monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])
quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]
RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]
p1 <- ggplot(data = monte_trades, aes(x = trade, y = V1, group = variable)) +
geom_line(size = 2, color = 'blue', alpha = 0.01) +
geom_line(data = quantile_trade, aes(x = trade, y = V1, group = 1), size = 2, alpha = 0.5, colour = 'blue') +
ggtitle('Simulated Trade Sequences of Length 1000')
p2 <- ggplot(data = monte_trades_last, aes(V1)) +
geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +
scale_x_continuous(limits = c(min(monte_trades$V1), max(monte_trades$V1))) +
coord_flip() +
ggtitle('Cumulative Profit Density')
p3 <- ggplot(data = RF_last, aes(V1)) +
geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +
geom_vline(xintercept = mean(RF_last$V1), colour = "blue", linetype = 2, size = 1) +
geom_vline(xintercept = median(RF_last$V1), colour = "red", linetype = 2, size = 1) +
ggtitle('Recovery Factor Density + Mean (blue) and Median (red)')
grid.arrange(p1, p2, p3, ncol = 3)
Sys.time() - start
gc()
Works for about 45 seconds. Draws for about 1.5 minutes.
The purpose of the simulation is to make sure that the F-statistic does not exceed the critical value when the MNA regression is built on independent data (not containing dependencies), satisfying the requirements of the linear model. So, it can be used as an indicator of model training.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is better than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?
I didn't fully understand about the fstatistic. The data here is random, but the model has learned something, so you can conclude that the model is fitting and over-trained. Which means the model evaluation must be bad. That is, I was expecting a negative fstatistis, or some other indication that things are bad on the graph.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is better than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?
Look at the F distribution table.http://www.socr.ucla.edu/applets.dir/f_table.html
The F-statistic is a value that depends on the degrees of freedom. It is always positive, since we have a one-sided distribution.
But the model doesn't learn anything, because the trained model must have a high F-statistic (greater than or equal to critical at a given alpha - as it sounds when testing the null hypothesis).
In all cases did not exceed the critical value at alpha = 0.01, and can be set to 0.0001, for example.
That said, I wanted to make sure (I didn't study this at university) that by adding noise variables the linear model would not show an increase in learning. As you can see...
Useful Code. Visualizes transaction sequences in three hypostases.
Regarding the above code. Please write at least brief comments in the code. Especially when you use complex expressions. Not everyone knows and uses the "data.table" package, it is not superfluous to explain what dcast.data.table does, what is .N, .SD. You don't post the code to show how deep you are into the subject. In my opinion the published code should help other users (even with beginner's level) to understand the script.
It is wonderful that R allows you to program the action in several ways, but it is desirable that the readability of the code is not lost.
A few suggestions on the code:
- intermediate variables x, x.cast, x.cum are not needed in calculations and only take up memory. All calculations which don't require saving intermediate results, it's desirable to do them through pipe
For instance
rm(list=ls());gc()
library(data.table)
library(ggplot2)
library(gridExtra)
library(tseries)
#----
require(magrittr)
require(dplyr)
start <- Sys.time()
monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%
.[, variable := rep(1:1000, times = 1000)]%>%
.[, trade := 1:.N, by = variable] %>%
dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%
.[, as.list(cumsum(unlist(.SD))), by = variable]%>%
melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%
setorder(., variable, trade)
monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])
quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]
RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]
Sys.time() - start
#Time difference of 2.247022 secs
Of course, it takes a very long time to build graphs.
Not a criticism.
Good luck
I didn't fully understand about the fstatistic. The data here is random, but the model has learned something, so we can conclude that the model is fitting and over-trained. Which means the model evaluation must be bad. That is, I was expecting a negative fstatistis, or some other indication that things are bad on the graph.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is more qualitative than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?
And here is an example where we assume that the fully trained model will include 20 variables, with their weight increasing (1 variable - weight 1, the 20th variable - weight 20). And let's see how the distribution of F statistics will change after successive predictors are added to the model:
############### simulate lm f-stats with non-random vars
rm(list=ls());gc()
library(data.table)
library(ggplot2)
start <- Sys.time()
set.seed(1)
x <- as.data.table(matrix(rnorm(20000000, 0, 1), ncol = 20))
x[, (paste0('coef', c(1:20))):= lapply(1:20, function(x) rnorm(.N, x, 1))]
x[, output:= Reduce(`+`, Map(function(x, y) (x * y), .SD[, (1:20), with = FALSE], .SD[, (21:40), with = FALSE])), .SDcols = c(1:40)]
x[, sampling:= sample(1000, nrow(x), replace = T)]
lm_models <- x[,
{
lapply(c(1:20), function(x) summary(lm(data = .SD[, c(1:x, 41), with = F], formula = output ~ . -1))$'fstatistic'[[1]])
}
, by = sampling
]
lm_models_melted <- melt(lm_models, measure.vars = paste0('V', c(1:20)))
crtitical_f_stats <- qf(p = 0.99, df1 = c(1:20), df2 = 1000, lower.tail = TRUE, log.p = FALSE)
boxplot(data = lm_models_melted, value ~ variable, log = 'y'); lines(crtitical_f_stats, type = 's', col = 'red')
summary(lm(data = x[sample(1000000, 1000, replace = T), c(1:20, 41), with = F], formula = output ~ . -1))
Sys.time() - start
gc()
Graph with logarithmic y-axis:
Apparently, yes...
> summary(lm(data = x[sample(1000000, 1000, replace = T), c(1:20, 41), with = F], formula = output ~ . -1))
Call:
lm(formula = output ~ . - 1, data = x[sample(1e+06, 1000, replace = T),
c(1:20, 41), with = F])
Residuals:
Min 1Q Median 3Q Max
-19.6146 -2.8252 0.0192 3.0659 15.8853
Coefficients:
Estimate Std. Error t value Pr(>|t|)
V1 0.9528 0.1427 6.676 4.1e-11 ***
V2 1.7771 0.1382 12.859 < 2e-16 ***
V3 2.7344 0.1442 18.968 < 2e-16 ***
V4 4.0195 0.1419 28.325 < 2e-16 ***
V5 5.2817 0.1479 35.718 < 2e-16 ***
V6 6.2776 0.1509 41.594 < 2e-16 ***
V7 6.9771 0.1446 48.242 < 2e-16 ***
V8 7.9722 0.1469 54.260 < 2e-16 ***
V9 9.0349 0.1462 61.806 < 2e-16 ***
V10 10.1372 0.1496 67.766 < 2e-16 ***
V11 10.8783 0.1487 73.134 < 2e-16 ***
V12 11.9129 0.1446 82.386 < 2e-16 ***
V13 12.8079 0.1462 87.588 < 2e-16 ***
V14 14.2017 0.1487 95.490 < 2e-16 ***
V15 14.9080 0.1458 102.252 < 2e-16 ***
V16 15.9893 0.1428 111.958 < 2e-16 ***
V17 17.4997 0.1403 124.716 < 2e-16 ***
V18 17.8798 0.1448 123.470 < 2e-16 ***
V19 18.9317 0.1470 128.823 < 2e-16 ***
V20 20.1143 0.1466 137.191 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.581 on 980 degrees of freedom
Multiple R-squared: 0.9932, Adjusted R-squared: 0.993
F-statistic: 7123 on 20 and 980 DF, p-value: < 2.2e-16
Thank you! I didn't know how to do that yet. But really, you should keep the calculations in your memory as much as possible. It will be faster. Good kung fu...
With your tweaks:
#---variant-------------
rm(list=ls());gc()
library(data.table)
library(ggplot2)
library(gridExtra)
library(tseries)
#----
require(magrittr)
require(dplyr)
start <- Sys.time()
monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%
.[, variable := rep(1:1000, times = 1000)]%>%
.[, trade := 1:.N, by = variable] %>%
dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%
.[, as.list(cumsum(unlist(.SD))), by = variable]%>%
melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%
setorder(., variable, trade)
monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])
quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]
RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]
p1 <- ggplot(data = monte_trades, aes(x = trade, y = V1, group = variable)) +
geom_line(size = 2, color = 'blue', alpha = 0.01) +
geom_line(data = quantile_trade, aes(x = trade, y = V1, group = 1), size = 2, alpha = 0.5, colour = 'blue') +
ggtitle('Simulated Trade Sequences of Length 1000')
p2 <- ggplot(data = monte_trades_last, aes(V1)) +
geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +
scale_x_continuous(limits = c(min(monte_trades$V1), max(monte_trades$V1))) +
coord_flip() +
ggtitle('Cumulative Profit Density')
p3 <- ggplot(data = RF_last, aes(V1)) +
geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +
geom_vline(xintercept = mean(RF_last$V1), colour = "blue", linetype = 2, size = 1) +
geom_vline(xintercept = median(RF_last$V1), colour = "red", linetype = 2, size = 1) +
ggtitle('Recovery Factor Density + Mean (blue) and Median (red)')
grid.arrange(p1, p2, p3, ncol = 3)
Sys.time() - start
The running time is 47 seconds. That is, the code is prettier and more compact, but there is no difference in speed... Drawing, yes, is very long. 1000 lines with transparency - because of them...
Thank you! I didn't know how to do that yet. But really, you should keep the calculations in your memory as much as possible. It'll be faster. Good kung fu...
The running time is 47 seconds. I mean, the code is nicer and more compact, but there's no difference in speed... Drawing, yes, is very long. 1000 lines with transparency - because of them...
My calculation takes
#time in seconds
# min lq mean median uq max neval
# 2.027561 2.253354 2.254134 2.275785 2.300051 2.610649 100
But this is not so important. It was about the readability of the code.
Good luck
PS. And parallelize the lm() calculations. This is exactly the case when you need
I have a calculation that takes
#-execution time in seconds
# min lq mean median uq max neval
# 2.027561 2.253354 2.254134 2.275785 2.300051 2.610649 100
But this is not so important. It was about the readability of the code.
Good luck
PS. And parallelize your lm() calculations. That's exactly what you need.
Nope. You are giving the time for part of the code before the graphs. I've indicated it, along with the graphs.
My time to graphs is 1.5 seconds. Your method is 1.15 seconds.
rm(list=ls());gc()
library(data.table)
library(ggplot2)
library(gridExtra)
library(tseries)
start <- Sys.time()
set.seed(1)
x <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) #random normal value with positive expectation
x[, variable:= rep(1:1000, times = 1000)]
x[, trade:= 1:.N, by = variable]
x.cast = dcast.data.table(x, variable ~ trade, value.var = 'V1', fun.aggregate = sum)
x_cum <- x.cast[, as.list(cumsum(unlist(.SD))), by = variable]
monte_trades <- melt(x_cum, measure.vars = names(x_cum)[-1], variable.name = "trade", value.name = 'V1')
setorder(monte_trades, variable, trade)
monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])
quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]
RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]
Sys.time() - start
rm(list=ls());gc()
library(data.table)
library(ggplot2)
library(gridExtra)
library(tseries)
#----
require(magrittr)
require(dplyr)
start <- Sys.time()
monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%
.[, variable := rep(1:1000, times = 1000)]%>%
.[, trade := 1:.N, by = variable] %>%
dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%
.[, as.list(cumsum(unlist(.SD))), by = variable]%>%
melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%
setorder(., variable, trade)
monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])
quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]
RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]
Sys.time() - start
It turns out that your method is faster...