Machine learning in trading: theory, models, practice and algo-trading - page 3256

 
fxsaber #:

It must be some peculiarities of Python, because the algorithm is the same in MQL.

  1. We run through the 1d-array of Pos-variables.
  2. [Pos-n, Pos] - another pattern.
  3. We applied something similar to this pattern and 1d-array.
  4. We found situations where MathAbs(corr[i]) > 0.9.
  5. In these places we looked at m bars ahead of the price behaviour and averaged it.
  6. We found a lot of places and averaged them nicely? - saved the pattern data (values from step 2).
  7. Pos++ and on p.2.

This is a frontal variant. The sieve is even faster.


Let's assume one million bars. The length of the string is 10. Then 1d-array for 10 million double-values is 80 Mb. p.3. - Well, let it be 500 Mb in terms of memory consumption. What haven't I taken into account?

Correlation of the matrix of all rows to all rows is considered many times faster than cycles (1 row to every other row) or even a loop (1 row to all rows). There's some kind of acceleration there due to the algorithm. I checked it on the alglib version of correlation calculation.
 
Forester #:
Correlation of the matrix of all rows to all rows is considered many times faster than cycles (1 row to every other row) and even loop (1 row to all rows). There's some kind of acceleration there due to the algorithm. I checked it on the alglib version of correlation calculation.

Give me the code, let's check it.

 
fxsaber #:


  1. We found situations where MathAbs(corr[i]) > 0.9.

MathAbs () seems unnecessary to me

 
fxsaber #:

It must be some peculiarities of Python, because the algorithm is the same in MQL.

  1. We run through the 1d-array of Pos-variables.
  2. [Pos-n, Pos] - another pattern.
  3. We applied something similar to this pattern and 1d-array.
  4. We found situations where MathAbs(corr[i]) > 0.9.
  5. In these places we looked at m bars ahead of the price behaviour and averaged it.
  6. We found a lot of places and averaged them nicely? - saved the pattern data (values from step 2).
  7. Pos++ and on p.2.

This is a frontal variant. The sieve is even faster.


Let's assume one million bars. The length of the string is 10. Then 1d-array for 10 million double-values is 80 Mb. p.3. - Well, let it be 500 Mb in terms of memory consumption. What haven't I taken into account?

I was baffled myself that no library in Python can calculate it, so I ended up confused.

Pandas overflows the RAM, overhead is gigantic.

Nampai just crashes and kills the interpreter session :) without displaying any errors.

You can make a sieve, you'd have to rewrite the whole code.
 
fxsaber #:

Give me the code, let's check it out.

in statistics.mqh.

functions
PearsonCorrM - Correlation of all rows to all rows is the fastest.

//+------------------------------------------------------------------+
//| Pearson product-moment correlation matrix                        |
//| INPUT PARAMETERS:                                                |
//|     X   -   array[N,M], sample matrix:                           |
//|             * J-th column corresponds to J-th variable           |
//|             * I-th row corresponds to I-th observation           |
//|     N   -   N>=0, number of observations:                        |
//|             * if given, only leading N rows of X are used        |
//|             * if not given, automatically determined from input  |
//|               size                                               |
//|     M   -   M>0, number of variables:                            |
//|             * if given, only leading M columns of X are used     |
//|             * if not given, automatically determined from input  |
//|               size                                               |
//| OUTPUT PARAMETERS:                                               |
//|     C   -   array[M,M], correlation matrix (zero if N=0 or N=1)  |
//+------------------------------------------------------------------+
static bool CBaseStat::PearsonCorrM(const CMatrixDouble &cx,const int n,
                                    const int m,CMatrixDouble &c)

PearsonCorr2 - Correlation row to row. For a full matrix: the 1st row is checked with all rows after 1, the 2nd row with all rows after the 2nd, etc.

//+------------------------------------------------------------------+
//| Pearson product-moment correlation coefficient                   |
//| Input parameters:                                                |
//|     X       -   sample 1 (array indexes: [0..N-1])               |
//|     Y       -   sample 2 (array indexes: [0..N-1])               |
//|     N       -   N>=0, sample size:                               |
//|                 * if given, only N leading elements of X/Y are   |
//|                   processed                                      |
//|                 * if not given, automatically determined from    |
//|                   input sizes                                    |
//| Result:                                                          |
//|     Pearson product-moment correlation coefficient               |
//|     (zero for N=0 or N=1)                                        |
//+------------------------------------------------------------------+
static double CBaseStat::PearsonCorr2(const double &cx[],const double &cy[],
                                      const int n)



And through PearsonCorrM2 you can write the full matrix into 1 matrix and another row to be checked. So you can check 1 row to all rows at once.But there is obvious unnecessary work, because for the 10th row the correlation with rows above 10 is already calculated.

static bool CBaseStat::PearsonCorrM2(const CMatrixDouble &cx,const CMatrixDouble &cy,
                                     const int n,const int m1,const int m2,
                                     CMatrixDouble &c)


Check on a matrix of about 5k*20k. If 100*100 it will be fast.
 
Numpy a 20k*20k matrix weighs 2gb
 
Maxim Dmitrievsky #:
A 20k*20k Numpy matrix weighs 2gb.

400 million double numbers weighs 3 gigs.

 
mytarmailS #:

MathAbs () seems redundant to me.

You can also check the signs separately. Not the point.

 
fxsaber #:

400 million double numbers weigh 3 gigabytes.

It's understandable, there's not enough memory for all this joy.

 
Forester #:
in statistics.mqh.

functions
PearsonCorrM - Correlation of all rows to all rows is the fastest.

Wrong somewhere, but I don't see it.
#include <Math\Alglib\statistics.mqh> 

void OnStart()
{
  const matrix<double> matrix1 = {{1, 2, 3}, {1, 2, 3}, {1, 2, 3}};
  
  const CMatrixDouble Matrix1(matrix1);
  CMatrixDouble Matrix2;
    
  if (CBaseStat::PearsonCorrM(Matrix1, 3, 3, Matrix2))  
    Print(Matrix2.ToMatrix());
}