交易中的机器学习：理论、模型、实践和算法交易

Alexey Burnakov 2016.11.15 14:56 #2121

另一个模拟的例子。

我们建立了20,000个线性模型（各地有1,000个观测值，预测因子的数量从1到20（每个数字有1,000个模型），加上一个自变量）。数据为i.i.d.，N(0,1)。

模拟的目的是确保当MNA回归建立在独立数据（不包含依赖关系）上时，F统计量不超过临界值，满足lin.model的要求。所以，它可以作为模型训练的一个指标。

############### simulate lm f-stats with random vars

rm(list=ls());gc()

library(data.table)

library(ggplot2)

start <- Sys.time()

set.seed(1)

x <- as.data.table(matrix(rnorm(21000000, 0, 1), ncol = 21))

x[, sampling:= sample(1000, nrow(x), replace = T)]

lm_models <- x[,

{

lapply(c(1:20), function(x) summary(lm(data = .SD[, c(1:x, 21), with = F], formula = V21 ~ . -1))$'fstatistic'[[1]])

}

, by = sampling

]

lm_models_melted <- melt(lm_models, measure.vars = paste0('V', c(1:20)))

crtitical_f_stats <- qf(p = 0.99, df1 = c(1:20), df2 = 1000, lower.tail = TRUE, log.p = FALSE)

boxplot(data = lm_models_melted, value ~ variable); lines(crtitical_f_stats, type = 's', col = 'red')

Sys.time() - start

gc()

代码运行时间。1.35分钟。

Machine learning in trading: Machine learning methods Machine Learning and Neural

Alexey Burnakov 2016.11.15 16:21 #2122

有用的代码。可视化三个阶段的交易序列。

##########################

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

start <- Sys.time()

set.seed(1)

x <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) #random normal value with positive expectation

x[, variable:= rep(1:1000, times = 1000)]

x[, trade:= 1:.N, by = variable]

x.cast = dcast.data.table(x, variable ~ trade, value.var = 'V1', fun.aggregate = sum)

x_cum <- x.cast[, as.list(cumsum(unlist(.SD))), by = variable]

monte_trades <- melt(x_cum, measure.vars = names(x_cum)[-1], variable.name = "trade", value.name = 'V1')

setorder(monte_trades, variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]

p1 <- ggplot(data = monte_trades, aes(x = trade, y = V1, group = variable)) +

geom_line(size = 2, color = 'blue', alpha = 0.01) +

geom_line(data = quantile_trade, aes(x = trade, y = V1, group = 1), size = 2, alpha = 0.5, colour = 'blue') +

ggtitle('Simulated Trade Sequences of Length 1000')

p2 <- ggplot(data = monte_trades_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +

scale_x_continuous(limits = c(min(monte_trades$V1), max(monte_trades$V1))) +

coord_flip() +

ggtitle('Cumulative Profit Density')

p3 <- ggplot(data = RF_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +

geom_vline(xintercept = mean(RF_last$V1), colour = "blue", linetype = 2, size = 1) +

geom_vline(xintercept = median(RF_last$V1), colour = "red", linetype = 2, size = 1) +

ggtitle('Recovery Factor Density + Mean (blue) and Median (red)')

grid.arrange(p1, p2, p3, ncol = 3)

Sys.time() - start

gc()

运行时间约为45秒。绘制约1.5分钟。

Machine learning in trading: Machine Learning and Neural Getting testing financial statistics:

Dr. Trader 2016.11.15 19:54 #2123

阿列克谢-伯纳科夫。

很漂亮，谢谢你。

Dr. Trader 2016.11.16 04:31 #2124

阿列克谢-伯纳科夫。

模拟的目的是确保当MNA回归建立在独立数据（不包含依赖关系）上时，F统计量不超过一个临界值，满足线性模型的要求。所以，它可以作为模型训练的一个指标。

我没有完全理解fstatistic的意思。这里的数据是随机的，但是模型被训练到了一些东西，所以你可以得出结论，模型是拟合的，是过度训练的。这意味着模型评估一定是坏的。也就是说，我在期待一个负的fstatistis，或者其他一些表明图表上的情况很糟糕的迹象。
我如何正确解释这个例子的结果？
我的理解是，第一个预测器可以被认为比第一个+第二个更有质量。而1+2比1+2+3更好。是这样吗？通过遗传学选择能给出最高fstatistic的预测因子集是否合理？

Alexey Burnakov 2016.11.16 08:59 #2125

Dr.Trader:
我没有完全理解fstatistic的意思。这里的数据是随机的，但模型已经学到了一些东西，所以你可以得出结论，模型是拟合的，是过度训练的。这意味着模型评估一定是坏的。也就是说，我在期待一个负的fstatistis，或者其他一些表明图表上的情况很糟糕的迹象。
我如何正确解释这个例子的结果？
我的理解是，第一个预测器可以被认为比第一个+第二个更有质量。而1+2比1+2+3更好。是这样吗？通过遗传学选择能给出最高fstatistic的预测因子集是否合理？

请看F分布表。http://www.socr.ucla.edu/applets.dir/f_table.html

F统计量是一个取决于自由度的数值。它总是正的，因为我们有一个单边分布。

但这个模型并没有学到任何东西，因为一个训练有素的模型必须有很高的F统计量（大于或等于给定α时的临界值--正如它在检验无效假设时的声音）。

在所有情况下，不超过α=0.01的临界值，但你可以把它设置为0.0001，比如说。

也就是说，我想确定（我在大学里没有学过这个），通过增加噪声变量，线性模型不会显示出学习的增加。正如你所看到的...

F-Distribution Tables

Ivo Dinov: www.SOCR.ucla.edu
www.socr.ucla.edu

Statistics Online Computational Resource

Vladimir Perervenko 2016.11.16 09:48 #2126

阿列克谢-伯纳科夫。

有用的代码。可视化三个阶段的交易序列。

关于上述代码。请在代码中至少写上简短的评论。特别是当你使用复杂的表达式时。不是每个人都知道和使用 "data.table "包，解释dcast.data.table的作用不是多余的，融化什么是.N、.SD。你不贴出代码来显示你对它的了解有多深。在我看来，公布的代码应该有助于其他用户（即使是受过初级培训的人）理解这个脚本。

R允许你以多种方式进行动作编程，这很好，但最好不要失去代码的可读性。

关于代码的一些建议。

- 中间变量x, x.cast, x.cum在计算中不需要，只占用内存。所有不需要保存中间结果的计算，最好是通过管道进行

比如说

#---variant-------------
rm(list=ls());gc()
library(data.table)
library(ggplot2)
library(gridExtra)
library(tseries)
#----
require(magrittr)
require(dplyr)
start <- Sys.time()
monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%
        .[, variable := rep(1:1000, times = 1000)]%>%
        .[, trade := 1:.N, by = variable] %>%
        dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%
        .[, as.list(cumsum(unlist(.SD))), by = variable]%>%
        melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%
        setorder(., variable, trade)
monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])
quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]
RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]
Sys.time() - start
#Time difference of 2.247022 secs

当然，建立图表需要很长的时间。

不是批评。

祝好运

Alexey Burnakov 2016.11.16 09:55 #2127

Dr.Trader:
我没有完全理解fstatistic的意思。这里的数据是随机的，但模型已经学到了一些东西，所以你可以得出结论，模型是拟合的，是过度训练的。这意味着模型评估一定是坏的。也就是说，我在期待一个负的fstatistis，或者其他一些表明图表上的情况很糟糕的迹象。
我如何正确解释这个例子的结果？
我的理解是，第一个预测器可以被认为比第一个+第二个更有质量。而1+2比1+2+3更好。是这样吗？通过遗传学选择能给出最高fstatistic的预测因子集是否合理？

而这里有一个例子，我们假设完全训练好的模型将包括20个变量，权重不断增加（1个变量--权重1，第20个变量--权重20）。我们再来看看在连续加入预测因子后，F统计量的分布将如何变化。

############### simulate lm f-stats with non-random vars

rm(list=ls());gc()

library(data.table)

library(ggplot2)

start <- Sys.time()

set.seed(1)

x <- as.data.table(matrix(rnorm(20000000, 0, 1), ncol = 20))

x[, (paste0('coef', c(1:20))):= lapply(1:20, function(x) rnorm(.N, x, 1))]

x[, output:= Reduce(`+`, Map(function(x, y) (x * y), .SD[, (1:20), with = FALSE], .SD[, (21:40), with = FALSE])), .SDcols = c(1:40)]

x[, sampling:= sample(1000, nrow(x), replace = T)]

lm_models <- x[,

{

lapply(c(1:20), function(x) summary(lm(data = .SD[, c(1:x, 41), with = F], formula = output ~ . -1))$'fstatistic'[[1]])

}

, by = sampling

]

lm_models_melted <- melt(lm_models, measure.vars = paste0('V', c(1:20)))

crtitical_f_stats <- qf(p = 0.99, df1 = c(1:20), df2 = 1000, lower.tail = TRUE, log.p = FALSE)

boxplot(data = lm_models_melted, value ~ variable, log = 'y'); lines(crtitical_f_stats, type = 's', col = 'red')

summary(lm(data = x[sample(1000000, 1000, replace = T), c(1:20, 41), with = F], formula = output ~ . -1))

Sys.time() - start

gc()

用对数的Y轴作图。

显然，是的...

> summary(lm(data = x[sample(1000000, 1000, replace = T), c(1:20, 41), with = F], formula = output ~ . -1))

Call:

lm(formula = output ~ . - 1, data = x[sample(1e+06, 1000, replace = T),

c(1:20, 41), with = F])

Residuals:

Min 1Q Median 3Q Max

-19.6146 -2.8252 0.0192 3.0659 15.8853

Coefficients:

Estimate Std. Error t value Pr(>|t|)

V1 0.9528 0.1427 6.676 4.1e-11 ***

V2 1.7771 0.1382 12.859 < 2e-16 ***

V3 2.7344 0.1442 18.968 < 2e-16 ***

V4 4.0195 0.1419 28.325 < 2e-16 ***

V5 5.2817 0.1479 35.718 < 2e-16 ***

V6 6.2776 0.1509 41.594 < 2e-16 ***

V7 6.9771 0.1446 48.242 < 2e-16 ***

V8 7.9722 0.1469 54.260 < 2e-16 ***

V9 9.0349 0.1462 61.806 < 2e-16 ***

V10 10.1372 0.1496 67.766 < 2e-16 ***

V11 10.8783 0.1487 73.134 < 2e-16 ***

V12 11.9129 0.1446 82.386 < 2e-16 ***

V13 12.8079 0.1462 87.588 < 2e-16 ***

V14 14.2017 0.1487 95.490 < 2e-16 ***

V15 14.9080 0.1458 102.252 < 2e-16 ***

V16 15.9893 0.1428 111.958 < 2e-16 ***

V17 17.4997 0.1403 124.716 < 2e-16 ***

V18 17.8798 0.1448 123.470 < 2e-16 ***

V19 18.9317 0.1470 128.823 < 2e-16 ***

V20 20.1143 0.1466 137.191 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.581 on 980 degrees of freedom

Multiple R-squared: 0.9932, Adjusted R-squared: 0.993

F-statistic: 7123 on 20 and 980 DF, p-value: < 2.2e-16

测试日志 - 算法交易, 交易机器人

Alexey Burnakov 2016.11.16 10:02 #2128

弗拉基米尔-佩雷文科。

谢谢你!我当时还不知道怎么做。但实际上，你应该尽可能地把计算方法放在心上。会更快。好功夫...

随着你的调整。

#---variant-------------

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

#----

require(magrittr)

require(dplyr)

start <- Sys.time()

monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%

.[, variable := rep(1:1000, times = 1000)]%>%

.[, trade := 1:.N, by = variable] %>%

dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%

.[, as.list(cumsum(unlist(.SD))), by = variable]%>%

melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%

setorder(., variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]

p1 <- ggplot(data = monte_trades, aes(x = trade, y = V1, group = variable)) +

geom_line(size = 2, color = 'blue', alpha = 0.01) +

geom_line(data = quantile_trade, aes(x = trade, y = V1, group = 1), size = 2, alpha = 0.5, colour = 'blue') +

ggtitle('Simulated Trade Sequences of Length 1000')

p2 <- ggplot(data = monte_trades_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +

scale_x_continuous(limits = c(min(monte_trades$V1), max(monte_trades$V1))) +

coord_flip() +

ggtitle('Cumulative Profit Density')

p3 <- ggplot(data = RF_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') +

geom_vline(xintercept = mean(RF_last$V1), colour = "blue", linetype = 2, size = 1) +

geom_vline(xintercept = median(RF_last$V1), colour = "red", linetype = 2, size = 1) +

ggtitle('Recovery Factor Density + Mean (blue) and Median (red)')

grid.arrange(p1, p2, p3, ncol = 3)

Sys.time() - start

运行时间为47秒。我的意思是，代码更漂亮，更紧凑，但速度上没有区别......绘图，是的，非常长。1000条具有透明度的线 - 因为它们...

Machine learning in trading: Big Expert Advisor example 深度神经网络 (第 I 部)。准备数据

Vladimir Perervenko 2016.11.16 10:59 #2129

Alexey Burnakov:

谢谢你!我当时还不知道怎么做。但实际上，你应该尽可能地把计算方法放在心上。会更快。好功夫...

运行时间为47秒。我的意思是，代码更漂亮，更紧凑，但速度上没有区别......绘图，是的，非常长。1000条具有透明度的线 - 因为它们...

我的计算方法是

# 运行时间（秒

# min lq mean median uq max neval

# 2.027561 2.253354 2.254134 2.275785 2.300051 2.610649 100

但这并不那么重要。这是关于代码的可读性。

祝好运

PS。并对lm()的计算进行并行化。这正是你需要的情况

Alexey Burnakov 2016.11.16 11:50 #2130

Vladimir Perervenko:

我有一个计算，需要

#-执行时间，以秒为单位

# min lq mean median uq max neval

# 2.027561 2.253354 2.254134 2.275785 2.300051 2.610649 100

但这并不那么重要。这是关于代码的可读性。

祝好运

PS。并对lm()的计算进行并行化。这正是必要时的情况。

不对。你在图表前给出了部分代码的时间。我已经表明了这一点，同时还有图表。

我在图表前有1.5秒。你的方法是1.15秒。

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

start <- Sys.time()

set.seed(1)

x <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) #random normal value with positive expectation

x[, variable:= rep(1:1000, times = 1000)]

x[, trade:= 1:.N, by = variable]

x.cast = dcast.data.table(x, variable ~ trade, value.var = 'V1', fun.aggregate = sum)

x_cum <- x.cast[, as.list(cumsum(unlist(.SD))), by = variable]

monte_trades <- melt(x_cum, measure.vars = names(x_cum)[-1], variable.name = "trade", value.name = 'V1')

setorder(monte_trades, variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]

Sys.time() - start

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

#----

require(magrittr)

require(dplyr)

start <- Sys.time()

monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%

.[, variable := rep(1:1000, times = 1000)]%>%

.[, trade := 1:.N, by = variable] %>%

dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%

.[, as.list(cumsum(unlist(.SD))), by = variable]%>%

melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%

setorder(., variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]

Sys.time() - start

事实证明，你更快...

Machine learning in trading: Machine Learning and Neural 深度神经网络 (第 II 部)。制定和选择预测因子

交易中的机器学习：理论、模型、实践和算法交易 - 页 213