Machine learning in trading: theory, models, practice and algo-trading - page 212

 
Renat Fatkhullin:

R is an amazing system, which personally opened my eyes to how far we were in MetaTrader/MQL from real needs "to make complex calculations simply and right now".

We (C++ developers) have in our blood the approach "you can do everything yourself, and we give you the low-level base and the speed of calculations". We are fanatical about performance and we are good at it - MQL5 is great on 64 bits.

When I personally sat down to work on R I realized that I needed as many powerful functions in one line as possible and that I should be able to do research in general.

So we have made a sharp turn and started upgrading MetaTrader 5:

  • included the previously rewritten mathematical libraries Alglib and Fuzzy into the standard delivery, covered unit tests
  • Developed analog of statistical functions from R, ran tests and covered them with tests.
  • developed the first beta-version of Graphics library as an analogue of plot in R. we added single-line functions for fast output
  • They have started to change interfaces in terminal output windows so that you could operate with tabular data. They changed the direction of output, added disabling of unnecessary columns, changed the font to monospaced in the Expert Advisor output window
  • a powerful ArrayPrint function for automatic printing of arrays, including structures, was added.
  • added FileLoad and FileSave functions for quick loading/unloading of arrays to disk.


Of course, we are at the beginning of the way, but the right vector of forces is already clear.

Your motivation is great! If it's exactly as you say, you'll quickly bite the ninja to the bone with multichart)))

However, IMHO, here we have to create something radically new, that is, in addition to what you wrote, Mr. Reshetov, you need a research studio, to work with arbitrary datasets, not only downloadable from the market, because many things need to try very trivial, synthetic examples to understand what happens, well, you should understand me as a programmer programmer)) I would have to draw different charts, scatterplots, hitmaps, distributions and so on. In general, it would be really cool if directly from the metaeditor was available such a set of tools, but frankly I do not even hope ...

But in general, of course, I like the trend of your thoughts))

 
Renat Fatkhullin:

It was a polite answer with no details or verification. And the answer did not match Wolfram Alpha and Matlab, which is a problem.

There is no need to go sideways - the root question has been clearly stated.

What do you mean his answer didn't match Wolfram? Didn't match in that the person's answer was not "zero"? The man replied that he didn't think that at the zero point, where integral = 0, the density must necessarily be zero (I put the question to him that way). He explicitly said so. And he added that the density value at any point is irrelevant (I read "irrelevant" as irrelevant to the question at hand). This is quite a clear mathematical statement.

In the question at hand, the mathematics is important.

We have an integral of such-and-such a function (a gamma distribution probability density function). Everybody is used to that you can give Wolfram an equation with parameters: specify the area of integration and function parameters, and it will integrate and give the answer. But did you ever think that if you yourself sat down, calculated this integral on a given domain, you would get 0 at zero, and 1 on the whole domain, and some value [0,1] on some subarea. Simply by solving the equation!

The fact that the limit of the gamma distribution probability density function at the extreme point goes somewhere in the positive region is a property of that function. It has nothing to do with what you get by integrating that function. That's what the man wrote about.

I'm not dodging the root issues. I will reiterate that Our point has been validated by a person beyond our control - what density at zero is irrelevant (irrelevant).

 
Zhenya:

Your motivation is great! If everything is exactly as you say, you will quickly bite to the bone ninja c multichart)))

However, IMHO, here we will have to create something radically new, that is, in addition to what you wrote, Mr. Reshetov, you need a research studio, to work with arbitrary datasets, not only with those downloaded from the market, since many things need to try on quite trivial, synthetic examples to understand what happens, well, you should understand me as a programmer programmer)) I would have to draw different charts, scatterplots, hitmaps, distributions and so on. In general, it would be really cool if directly from the metaeditor was available such a set of tools, but frankly I do not even hope ...

But in general, of course, the trend of your thoughts I like))

Are you referring to this "shot" by Reshetov?

"Some kind of rottenness this R - a bicycle with square wheels. What to talk about some of his packages, when the basis, i.e. the core in R is crooked and needs serious fine-tuning with a "pencil file"? What credibility can those who haven't even bothered to check the correctness of the basic functions in R for so many years have? What can be the "strength" in the weakness of R - the incorrectness of calculations through it?

Well, that MetaQuotes opened the eyes of some users to the fact that in fact this same R represents, the facts and tests with open source, so that everyone could independently double-check and make sure, but not groundless. Not all of course opened, because some religious fanatics from destructive sect of R will continue to blindly believe in "infallibility" of calculations in their crooked language and packages, instead of turning to the presented tests and double-check them independently, but not fanatically bullshitting, defending crookedness of R as "generally accepted standard".

Now it's quite obvious that it would be better to use MQL functionality for creating trading strategies, because the result will be more correct, than to try to do it with curve and slope of R.

Special thanks to MetaQuotes developers for the constructive approach, tests and their sources, as well as for the identification of the "naked king - R"! "

 
Vladimir Perervenko:

Is this Reshetov "shot" what you mean?

No, this is the message:

Yury Reshetov:

R, as well as many other languages, is so far more convenient for machine learning than MQL due to the fact that it has a deliberate functionality to process data in arrays. The thing is that a sample for machine learning is most often a two-dimensional array, and so it requires some functionality to work with arrays:

  1. Inserting rows and columns as arrays of smaller size into another array
  2. Replacing rows and columns in an array as arrays of smaller size
  3. Deletion of rows and columns from an array (for example, to delete low-value predictors or examples with explicit "outliers" from the selection)
  4. Dividing arrays into parts, which results in two or more arrays that are parts of the original array (necessary for dividing a sample into training and test parts or into more parts, for example for Walling Forward).
  5. Random shuffling of rows and columns in an array with a uniform distribution (it is necessary for certain examples from a sample to fall into different parts, preferably evenly distributed over these parts).
  6. Various functions for data processing by separate rows or columns (for example, calculation of arithmetic mean value in a row or in a column, dispersion, or search of maximum or minimum value in a row for the subsequent normalization).
  7. And so on and so forth.

Until MQL has implemented the aforementioned functionality necessary for handling samples in arrays, most developers of machine learning algorithms will prefer other languages that already have all this available. Or they will use unpretentious MLP (algorithm of 1960s) from AlgLib which, if I remember correctly, for convenience represents two-dimensional arrays as one-dimensional ones.

Of course, functions of densities of random distributions are also necessary functionality. But such functions are not always needed in machine learning tasks, and in some tasks they are not used at all. But operations with samples as multidimensional arrays are what implementation of machine learning algorithms is always needed for any task, unless of course it is a task of training a grid to learn obviously normalized data from trivial COR.

 
Vladimir Perervenko:

Is this Reshetov "shot" what you mean?

"This R is rotten, a bicycle with square wheels. What to speak about some of his packages, when the very basis, i.e. the core in R is crooked and needs serious revision "with a pencil file"? What credibility can those who haven't even bothered to check the correctness of the basic functions in R for so many years have? What can be the "strength" in the weakness of R - the incorrectness of calculations through it?

Well, that MetaQuotes opened the eyes of some users to the fact that in fact this same R represents, the facts and tests with open source, so that everyone could independently double-check and make sure, but not groundless. Not all of course opened, because some religious fanatics from destructive sect of R will continue to blindly believe in "infallibility" of calculations in their crooked language and packages, instead of turning to the presented tests and double-check them independently, instead of fanatically bullshitting, defending crookedness of R as "generally accepted standard".

Now it's quite obvious that it would be better to use MQL functionality to create trading strategies, because the result will be more correct, than to try to do it with curve and slope of R.

Special thanks to MetaQuotes developers for the constructive approach, tests and their sources, as well as for the identification of the "Naked King - R"!

Have you already deleted your post about "minky MQL"? You're rubbing your posts the same way Radovian figures rub their Facebooks after Trump was elected.

 

Here is an example of gamma distribution in Wolfram Alpha for fun.

He is given a function, a slightly simplified gamma distribution density function.

The point is in the denominator x. The limit on the right, as you can see, at x->0 Wolfram estimates correctly: inf.

That is, the limit on the right is the zero-point density at infinity (which is exactly the answer to dgamma).

Let's integrate this function on a large saport:

The integral is 1 (rounded, of course, because the full sapport is not taken).

Conclusion, despite the fact that at the extreme point the function goes to infinity, the integral of this function counts fine as it should.

 
Alexey Burnakov:

Here is an example of the gamma distribution in Wolfram Alpha for your amusement.

The conclusion is that even though the function goes to infinity at the extreme point, the integral of that function counts just fine.

Thanks for the example, you are right. This integral is convergent.

Marginal values at point x=0 can also be used to determine the density and it will not lead to divergence.

 
Quantum:

Thanks for the example, you are correct. That integral is convergent.

The limit values at x=0 can also be used to determine the density and this will not lead to divergence.


Thank you! Respect.

 

Example by R with Fast Processing Packages.

library(data.table)

library(ggplot2)


start <- Sys.time()


set.seed(1)

dummy_coef <- 1:9


x <- as.data.table(matrix(rnorm(9000000, 0, 1), ncol = 9))

x[, (paste0('coef', c(1:9))):= lapply(1:9, function(x) rnorm(.N, x, 1))]


print(colMeans(x[, c(10:18), with = F]))


x[, output:= Reduce(`+`, Map(function(x, y) (x * y), .SD[, (1:9), with = FALSE], .SD[, (10:18), with = FALSE])), .SDcols = c(1:18)]


x[, sampling:= sample(1000, nrow(x), replace = T)]


lm_models <- x[, 

{

lm(data = .SD[, c(1:9, 19), with = F], formula = output ~ . -1)$coefficients

}, 

by = sampling]


lm_models[, coefs:= rep(1:9, times = 1000)]


avg_coefs <- lm_models[, mean(V1), by = coefs]

plot(dummy_coef, avg_coefs$V1)


lm_models[, 

  print(shapiro.test(V1)$p.value)

  , by = coefs]


ggplot(data = lm_models, aes(x = V1)) +

geom_histogram(binwidth = 0.05) +

facet_wrap(~ coefs, ncol = 3)


Sys.time() - start

Running time: 5 sec. Constructed 1000 linear models. Each on 1000 observations.

[1] 0.8908975

[1] 0.9146406

[1] 0.3111422

[1] 0.02741917

[1] 0.9824953

[1] 0.3194611

[1] 0.606778

[1] 0.08360257

[1] 0.4843107

All coefficients are normally distributed.

And ggplot-ic for visualization.

 

And another example. It also correlates with large samples to simulate statistics.

########## simulate diffference between mean density with different sample size


library(data.table)

library(ggplot2)


rm(list=ls());gc()


start <- Sys.time()


x <- rnorm(10000000, 0, 1)

y <- rnorm(10000000, 0, 1)


dat <- as.data.table(cbind(x, y))

dat[, (paste0('sampling_', c(100, 1000, 10000))):= lapply(c(100, 1000, 10000), function(x) sample(x, nrow(dat), replace = T))]


dat_melted <- melt(dat, measure.vars = paste0('sampling_', c(100, 1000, 10000)))


critical_t <- dat_melted[, 

   {

    mean(x) - mean(y)

   }

   , by = .(variable, value)]


ggplot(critical_t, aes(x = V1, group = variable, fill = variable)) + 

stat_density(alpha = 0.5)


Sys.time() - start


gc()

Running time is 3.4 seconds.

Normally distributed samples centered at zero are created:

1,000 to 10,000 pairs of values.

10,000 of 1,000 pairs of values

100,000 of 100 pairs of values

The difference between the means (MO == 0) for each sample is counted.

The densities of sampling mean distributions for samples of different sizes are derived.

Only here sampling_100 means that you need 10,000,000 / 100 to get the sample size. That is, for smaller samples, the standard error is larger...