How to find out what you're thinking about the Fstatistis - General

Alexey Burnakov 2016.11.15 15:56 #2121

Another simulation example.

We build 20,000 linear models (1,000 observations everywhere, number of predictors from 1 to 20 (1,000 models for each number), plus one independent variable). The data are i.i.d., N(0,1).

The purpose of the simulation is to make sure that the F-statistic does not exceed a critical value when the MNC regression is built on independent data (that do not contain dependencies) that satisfy the requirements of the lin.model. So, it can be used as an indicator of model training.

############### simulate lm f-stats with random vars

rm(list=ls());gc()

library(data.table)

library(ggplot2)

start <- Sys.time()

set.seed(1)

x <- as.data.table(matrix(rnorm(21000000, 0, 1), ncol = 21))

x[, sampling:= sample(1000, nrow(x), replace = T)]

lm_models <- x[,

{

lapply(c(1:20), function(x) summary(lm(data = .SD[, c(1:x, 21), with = F], formula = V21 ~ . -1))$'fstatistic'[[1]])

}

, by = sampling

]

lm_models_melted <- melt(lm_models, measure.vars = paste0('V', c(1:20)))

crtitical_f_stats <- qf(p = 0.99, df1 = c(1:20), df2 = 1000, lower.tail = TRUE, log.p = FALSE)

boxplot(data = lm_models_melted, value ~ variable); lines(crtitical_f_stats, type = 's', col = 'red')

Sys.time() - start

gc()

Code running time: 1.35 minutes.

Optimization Types - Algorithmic Strategy Testing - Algorithmic Testing Visualization - Algorithmic

Alexey Burnakov 2016.11.15 17:21 #2122

Useful Code. Visualizes transaction sequences in three hypostases.

##########################

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

start <- Sys.time()

set.seed(1)

x <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) #random normal value with positive expectation

x[, variable:= rep(1:1000, times = 1000)]

x[, trade:= 1:.N, by = variable]

x.cast = dcast.data.table(x, variable ~ trade, value.var = 'V1', fun.aggregate = sum)

x_cum <- x.cast[, as.list(cumsum(unlist(.SD))), by = variable]

monte_trades <- melt(x_cum, measure.vars = names(x_cum)[-1], variable.name = "trade", value.name = 'V1')

setorder(monte_trades, variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]

p1 <- ggplot(data = monte_trades, aes(x = trade, y = V1, group = variable)) +

geom_line(size = 2, color = 'blue', alpha = 0.01) +

geom_line(data = quantile_trade, aes(x = trade, y = V1, group = 1), size = 2, alpha = 0.5, colour = 'blue') +

ggtitle('Simulated Trade Sequences of Length 1000')

p2 <- ggplot(data = monte_trades_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +

scale_x_continuous(limits = c(min(monte_trades$V1), max(monte_trades$V1))) +

coord_flip() +

ggtitle('Cumulative Profit Density')

p3 <- ggplot(data = RF_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +

geom_vline(xintercept = mean(RF_last$V1), colour = "blue", linetype = 2, size = 1) +

geom_vline(xintercept = median(RF_last$V1), colour = "red", linetype = 2, size = 1) +

ggtitle('Recovery Factor Density + Mean (blue) and Median (red)')

grid.arrange(p1, p2, p3, ncol = 3)

Sys.time() - start

gc()

Works for about 45 seconds. Draws for about 1.5 minutes.

Testing Visualization - Algorithmic Testing Report - Algorithmic Strategy Optimization - Algorithmic

Dr. Trader 2016.11.15 20:54 #2123

Alexey Burnakov:

Beautiful, thank you.

Dr. Trader 2016.11.16 05:31 #2124

Alexey Burnakov:

The purpose of the simulation is to make sure that the F-statistic does not exceed the critical value when the MNA regression is built on independent data (not containing dependencies), satisfying the requirements of the linear model. So, it can be used as an indicator of model training.

I did not fully understand about fstatistic. The data here are random, but the model has learned something, so we can conclude that the model is fitting and retrained. Which means the model evaluation must be bad. That is, I was expecting a negative fstatistis, or some other indication that things are bad on the graph.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is better than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?

Moving Average - Trend Moving Average - Trend Moving Average - Trend

Alexey Burnakov 2016.11.16 09:59 #2125

Dr.Trader:
I didn't fully understand about the fstatistic. The data here is random, but the model has learned something, so you can conclude that the model is fitting and over-trained. Which means the model evaluation must be bad. That is, I was expecting a negative fstatistis, or some other indication that things are bad on the graph.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is better than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?

Look at the F distribution table.http://www.socr.ucla.edu/applets.dir/f_table.html

The F-statistic is a value that depends on the degrees of freedom. It is always positive, since we have a one-sided distribution.

But the model doesn't learn anything, because the trained model must have a high F-statistic (greater than or equal to critical at a given alpha - as it sounds when testing the null hypothesis).

In all cases did not exceed the critical value at alpha = 0.01, and can be set to 0.0001, for example.

That said, I wanted to make sure (I didn't study this at university) that by adding noise variables the linear model would not show an increase in learning. As you can see...

F-Distribution Tables

Ivo Dinov: www.SOCR.ucla.edu
www.socr.ucla.edu

Statistics Online Computational Resource

Trading Principles - Trade Trading Principles - Trade Options Board - Trading

Vladimir Perervenko 2016.11.16 10:48 #2126

Alexey Burnakov:

Useful Code. Visualizes transaction sequences in three hypostases.

Regarding the above code. Please write at least brief comments in the code. Especially when you use complex expressions. Not everyone knows and uses the "data.table" package, it is not superfluous to explain what dcast.data.table does, what is .N, .SD. You don't post the code to show how deep you are into the subject. In my opinion the published code should help other users (even with beginner's level) to understand the script.

It is wonderful that R allows you to program the action in several ways, but it is desirable that the readability of the code is not lost.

A few suggestions on the code:

- intermediate variables x, x.cast, x.cum are not needed in calculations and only take up memory. All calculations which don't require saving intermediate results, it's desirable to do them through pipe

For instance

#---variant-------------
rm(list=ls());gc()
library(data.table)
library(ggplot2)
library(gridExtra)
library(tseries)
#----
require(magrittr)
require(dplyr)
start <- Sys.time()
monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%
        .[, variable := rep(1:1000, times = 1000)]%>%
        .[, trade := 1:.N, by = variable] %>%
        dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%
        .[, as.list(cumsum(unlist(.SD))), by = variable]%>%
        melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%
        setorder(., variable, trade)
monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])
quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]
RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]
Sys.time() - start
#Time difference of 2.247022 secs

Of course, it takes a very long time to build graphs.

Not a criticism.

Good luck

Alexey Burnakov 2016.11.16 10:55 #2127

Dr.Trader:
I didn't fully understand about the fstatistic. The data here is random, but the model has learned something, so we can conclude that the model is fitting and over-trained. Which means the model evaluation must be bad. That is, I was expecting a negative fstatistis, or some other indication that things are bad on the graph.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is more qualitative than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?

And here is an example where we assume that the fully trained model will include 20 variables, with their weight increasing (1 variable - weight 1, the 20th variable - weight 20). And let's see how the distribution of F statistics will change after successive predictors are added to the model:

############### simulate lm f-stats with non-random vars

rm(list=ls());gc()

library(data.table)

library(ggplot2)

start <- Sys.time()

set.seed(1)

x <- as.data.table(matrix(rnorm(20000000, 0, 1), ncol = 20))

x[, (paste0('coef', c(1:20))):= lapply(1:20, function(x) rnorm(.N, x, 1))]

x[, output:= Reduce(`+`, Map(function(x, y) (x * y), .SD[, (1:20), with = FALSE], .SD[, (21:40), with = FALSE])), .SDcols = c(1:40)]

x[, sampling:= sample(1000, nrow(x), replace = T)]

lm_models <- x[,

{

lapply(c(1:20), function(x) summary(lm(data = .SD[, c(1:x, 41), with = F], formula = output ~ . -1))$'fstatistic'[[1]])

}

, by = sampling

]

lm_models_melted <- melt(lm_models, measure.vars = paste0('V', c(1:20)))

crtitical_f_stats <- qf(p = 0.99, df1 = c(1:20), df2 = 1000, lower.tail = TRUE, log.p = FALSE)

boxplot(data = lm_models_melted, value ~ variable, log = 'y'); lines(crtitical_f_stats, type = 's', col = 'red')

summary(lm(data = x[sample(1000000, 1000, replace = T), c(1:20, 41), with = F], formula = output ~ . -1))

Sys.time() - start

gc()

Graph with logarithmic y-axis:

Apparently, yes...

> summary(lm(data = x[sample(1000000, 1000, replace = T), c(1:20, 41), with = F], formula = output ~ . -1))

Call:

lm(formula = output ~ . - 1, data = x[sample(1e+06, 1000, replace = T),

c(1:20, 41), with = F])

Residuals:

Min 1Q Median 3Q Max

-19.6146 -2.8252 0.0192 3.0659 15.8853

Coefficients:

Estimate Std. Error t value Pr(>|t|)

V1 0.9528 0.1427 6.676 4.1e-11 ***

V2 1.7771 0.1382 12.859 < 2e-16 ***

V3 2.7344 0.1442 18.968 < 2e-16 ***

V4 4.0195 0.1419 28.325 < 2e-16 ***

V5 5.2817 0.1479 35.718 < 2e-16 ***

V6 6.2776 0.1509 41.594 < 2e-16 ***

V7 6.9771 0.1446 48.242 < 2e-16 ***

V8 7.9722 0.1469 54.260 < 2e-16 ***

V9 9.0349 0.1462 61.806 < 2e-16 ***

V10 10.1372 0.1496 67.766 < 2e-16 ***

V11 10.8783 0.1487 73.134 < 2e-16 ***

V12 11.9129 0.1446 82.386 < 2e-16 ***

V13 12.8079 0.1462 87.588 < 2e-16 ***

V14 14.2017 0.1487 95.490 < 2e-16 ***

V15 14.9080 0.1458 102.252 < 2e-16 ***

V16 15.9893 0.1428 111.958 < 2e-16 ***

V17 17.4997 0.1403 124.716 < 2e-16 ***

V18 17.8798 0.1448 123.470 < 2e-16 ***

V19 18.9317 0.1470 128.823 < 2e-16 ***

V20 20.1143 0.1466 137.191 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.581 on 980 degrees of freedom

Multiple R-squared: 0.9932, Adjusted R-squared: 0.993

F-statistic: 7123 on 20 and 980 DF, p-value: < 2.2e-16

Market Watch - Trading Testing Report - Algorithmic Journal of Testing -

Alexey Burnakov 2016.11.16 11:02 #2128

Vladimir Perervenko:

Thank you! I didn't know how to do that yet. But really, you should keep the calculations in your memory as much as possible. It will be faster. Good kung fu...

With your tweaks:

#---variant-------------

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

#----

require(magrittr)

require(dplyr)

start <- Sys.time()

monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%

.[, variable := rep(1:1000, times = 1000)]%>%

.[, trade := 1:.N, by = variable] %>%

dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%

.[, as.list(cumsum(unlist(.SD))), by = variable]%>%

melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%

setorder(., variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]

p1 <- ggplot(data = monte_trades, aes(x = trade, y = V1, group = variable)) +

geom_line(size = 2, color = 'blue', alpha = 0.01) +

geom_line(data = quantile_trade, aes(x = trade, y = V1, group = 1), size = 2, alpha = 0.5, colour = 'blue') +

ggtitle('Simulated Trade Sequences of Length 1000')

p2 <- ggplot(data = monte_trades_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +

scale_x_continuous(limits = c(min(monte_trades$V1), max(monte_trades$V1))) +

coord_flip() +

ggtitle('Cumulative Profit Density')

p3 <- ggplot(data = RF_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +

geom_vline(xintercept = mean(RF_last$V1), colour = "blue", linetype = 2, size = 1) +

geom_vline(xintercept = median(RF_last$V1), colour = "red", linetype = 2, size = 1) +

ggtitle('Recovery Factor Density + Mean (blue) and Median (red)')

grid.arrange(p1, p2, p3, ncol = 3)

Sys.time() - start

The running time is 47 seconds. That is, the code is prettier and more compact, but there is no difference in speed... Drawing, yes, is very long. 1000 lines with transparency - because of them...

Depth of Market - Testing Visualization - Algorithmic Strategy Optimization - Algorithmic

Vladimir Perervenko 2016.11.16 11:59 #2129

Alexey Burnakov:

Thank you! I didn't know how to do that yet. But really, you should keep the calculations in your memory as much as possible. It'll be faster. Good kung fu...

The running time is 47 seconds. I mean, the code is nicer and more compact, but there's no difference in speed... Drawing, yes, is very long. 1000 lines with transparency - because of them...

My calculation takes

#time in seconds

# min lq mean median uq max neval

# 2.027561 2.253354 2.254134 2.275785 2.300051 2.610649 100

But this is not so important. It was about the readability of the code.

Good luck

PS. And parallelize the lm() calculations. This is exactly the case when you need

Chaikin Oscillator - Oscillators Stochastic Oscillator - Oscillators Andrews' Pitchfork - Channels

Alexey Burnakov 2016.11.16 12:50 #2130

Vladimir Perervenko:

I have a calculation that takes

#-execution time in seconds

# min lq mean median uq max neval

# 2.027561 2.253354 2.254134 2.275785 2.300051 2.610649 100

But this is not so important. It was about the readability of the code.

Good luck

PS. And parallelize your lm() calculations. That's exactly what you need.

Nope. You are giving the time for part of the code before the graphs. I've indicated it, along with the graphs.

My time to graphs is 1.5 seconds. Your method is 1.15 seconds.

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

start <- Sys.time()

set.seed(1)

x <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) #random normal value with positive expectation

x[, variable:= rep(1:1000, times = 1000)]

x[, trade:= 1:.N, by = variable]

x.cast = dcast.data.table(x, variable ~ trade, value.var = 'V1', fun.aggregate = sum)

x_cum <- x.cast[, as.list(cumsum(unlist(.SD))), by = variable]

monte_trades <- melt(x_cum, measure.vars = names(x_cum)[-1], variable.name = "trade", value.name = 'V1')

setorder(monte_trades, variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]

Sys.time() - start

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

#----

require(magrittr)

require(dplyr)

start <- Sys.time()

monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%

.[, variable := rep(1:1000, times = 1000)]%>%

.[, trade := 1:.N, by = variable] %>%

dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%

.[, as.list(cumsum(unlist(.SD))), by = variable]%>%

melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%

setorder(., variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]

Sys.time() - start

It turns out that your method is faster...

Trading Report - Trading Optimization Types - Algorithmic Testing Visualization - Algorithmic

Machine learning in trading: theory, models, practice and algo-trading - page 213