Machine learning in trading: theory, models, practice and algo-trading - page 3256

fxsaber #:

It must be some peculiarities of Python, because the algorithm is the same in MQL.

  1. We run through the 1d-array of Pos-variables.
  2. [Pos-n, Pos] - another pattern.
  3. We applied something similar to this pattern and 1d-array.
  4. We found situations where MathAbs(corr[i]) > 0.9.
  5. In these places we looked at m bars ahead of the price behaviour and averaged it.
  6. We found a lot of places and averaged them nicely? - saved the pattern data (values from step 2).
  7. Pos++ and on p.2.

This is a frontal variant. The sieve is even faster.

Let's assume one million bars. The length of the string is 10. Then 1d-array for 10 million double-values is 80 Mb. p.3. - Well, let it be 500 Mb in terms of memory consumption. What haven't I taken into account?

Correlation of the matrix of all rows to all rows is considered many times faster than cycles (1 row to every other row) or even a loop (1 row to all rows). There's some kind of acceleration there due to the algorithm. I checked it on the alglib version of correlation calculation.
Forester #:
Give me the code, let's check it.

fxsaber #:

  1. We found situations where MathAbs(corr[i]) > 0.9.

MathAbs () seems unnecessary to me

fxsaber #:

I was baffled myself that no library in Python can calculate it, so I ended up confused.

Pandas overflows the RAM, overhead is gigantic.

Nampai just crashes and kills the interpreter session :) without displaying any errors.

You can make a sieve, you'd have to rewrite the whole code.
fxsaber #:

Give me the code, let's check it out.

in statistics.mqh.

PearsonCorrM - Correlation of all rows to all rows is the fastest.

//| Pearson product-moment correlation matrix                        |
//| INPUT PARAMETERS:                                                |
//|     X   -   array[N,M], sample matrix:                           |
//|             * J-th column corresponds to J-th variable           |
//|             * I-th row corresponds to I-th observation           |
//|     N   -   N>=0, number of observations:                        |
//|             * if given, only leading N rows of X are used        |
//|             * if not given, automatically determined from input  |
//|               size                                               |
//|     M   -   M>0, number of variables:                            |
//|             * if given, only leading M columns of X are used     |
//|             * if not given, automatically determined from input  |
//|               size                                               |
//| OUTPUT PARAMETERS:                                               |
//|     C   -   array[M,M], correlation matrix (zero if N=0 or N=1)  |
static bool CBaseStat::PearsonCorrM(const CMatrixDouble &cx,const int n,
                                    const int m,CMatrixDouble &c)

PearsonCorr2 - Correlation row to row. For a full matrix: the 1st row is checked with all rows after 1, the 2nd row with all rows after the 2nd, etc.

//| Pearson product-moment correlation coefficient                   |
//| Input parameters:                                                |
//|     X       -   sample 1 (array indexes: [0..N-1])               |
//|     Y       -   sample 2 (array indexes: [0..N-1])               |
//|     N       -   N>=0, sample size:                               |
//|                 * if given, only N leading elements of X/Y are   |
//|                   processed                                      |
//|                 * if not given, automatically determined from    |
//|                   input sizes                                    |
//| Result:                                                          |
//|     Pearson product-moment correlation coefficient               |
//|     (zero for N=0 or N=1)                                        |
static double CBaseStat::PearsonCorr2(const double &cx[],const double &cy[],
                                      const int n)

And through PearsonCorrM2 you can write the full matrix into 1 matrix and another row to be checked. So you can check 1 row to all rows at once.But there is obvious unnecessary work, because for the 10th row the correlation with rows above 10 is already calculated.

static bool CBaseStat::PearsonCorrM2(const CMatrixDouble &cx,const CMatrixDouble &cy,
                                     const int n,const int m1,const int m2,
                                     CMatrixDouble &c)

Check on a matrix of about 5k*20k. If 100*100 it will be fast.
Numpy a 20k*20k matrix weighs 2gb
Maxim Dmitrievsky #:
A 20k*20k Numpy matrix weighs 2gb.

400 million double numbers weighs 3 gigs.

mytarmailS #:

MathAbs () seems redundant to me.

You can also check the signs separately. Not the point.

fxsaber #:

400 million double numbers weigh 3 gigabytes.

It's understandable, there's not enough memory for all this joy.

Forester #:
in statistics.mqh.

PearsonCorrM - Correlation of all rows to all rows is the fastest.

Wrong somewhere, but I don't see it.
#include <Math\Alglib\statistics.mqh> 

void OnStart()
  const matrix<double> matrix1 = {{1, 2, 3}, {1, 2, 3}, {1, 2, 3}};
  const CMatrixDouble Matrix1(matrix1);
  CMatrixDouble Matrix2;
  if (CBaseStat::PearsonCorrM(Matrix1, 3, 3, Matrix2))  