Help write a linear regression - page 5

 
Rosh писал (а) >>

And this is what Excel 2007 gives out



So, it might be necessary to check Matcad.

If I understand correctly, Excel gives the 3rd result, different from the first two). Where is the truth? B is the same as Matcad, but coefficient A is different.

 
Prival писал (а) >>

If I understand it correctly, Excel gave the 3rd result, different from the first two ). Where is the truth? B is the same as in Matcad, but the coefficient A is different.

In general, with so many significant digits and such a range, the calculations go somewhere in the back of the mantissa. That is, even elements of randomness may be introduced into the answer. I think correctness can be expected only for some special algorithm of high accuracy. In this case you'd better shift the origin of coordinates closer to the X range .



P.S. Especially when the sum of X*X is calculated, the information just goes straight into the toilet :)

 
lna01 писал (а) >>

Generally, with so many significant digits and such a range, the calculations go somewhere in the back of the mantissa. That is, even elements of randomness can be introduced into the answer. I think correctness can only be counted on for some special algorithm with increased accuracy. But in this case it's better to move the origin of coordinates closer to the X range.

The thing is that I've started to prepare for the championship. I started to translate my developments in Matkadec into MQL. If you remember, we were building the ACF (autocorrelation function) and I started with it and decided to use direct formulas, as I put too much load on the CPU via Fourier transforms.

That is why I started to look for the source of the rake (

I will try to shift X (Time) to 0. But I will have to recheck everything again. I already have to give up about 50% of my ideas.

 
Prival писал (а) >>

The thing is, I've started to prepare for the championship. And I started transferring my developments in Matcadet to MQL. If you remember, we were building ACF (autocorrelation function), I started with it and decided to calculate it using direct formulas, since Fourier transforms are a heavy load on the processor.

That is why I started to look for the source of the rake (

I will try to shift X (Time) to 0. But I will have to recheck everything again. As it is, I already have to give up 50% of my ideas

MT keeps 15 digits in the mantissa. If we extract the root we get 10^7. That is, you have to square and sum up numbers greater than 10000000, see postscript to the previous post :). Fortunately, that limit corresponds to the number of minute bars in the real history, so if it is it, it should still work. But if it is time, then a shift of coordinate origin is just inevitable.


P.S. By the way, if you are going to use your function on the championship, add protection against division by zero. Otherwise there is a risk that your indicator will just stand in the middle of the championship. Or at the beginning. Remember, there was such a thing with Fourier.

 

The same algorithm in Java

import java.util.ArrayList;

public class Prival {
public static void main(String arg[]){
int N = 6;
double Y[];
double X[];
ArrayList<Double> Parameters = new ArrayList<Double>();
Parameters.add(0.0);
Parameters.add(0.0);
X = new double[6];
Y = new double[6];
for ( int i = 0; i < N; i ++ )
{
// массивы Y и X для проверки работоспособности
// intercept = -3.33333333 slope = 5.00000000

X[i]=i;
Y[i]=i*i;
}

LinearRegr(X, Y, N,Parameters);
System.out.println("intercept = "+Parameters.get(0)+" slope = "+ Parameters.get(1));

// вторая проверка
X[0]=1216640160;
X[1]=1216640100;
X[2]=1216640040;
X[3]=1216639980;
X[4]=1216639920;
X[5]=1216639860;

Y[0]=1.9971;
Y[1]=1.9970;
Y[2]=1.9967;
Y[3]=1.9969;
Y[4]=1.9968;
Y[5]=1.9968;


LinearRegr(X, Y, N, Parameters);
System.out.println("intercept = "+Parameters.get(0)+" slope = "+ Parameters.get(1));

}
public static void LinearRegr(double X[], double Y[], int N, ArrayList<Double> Parameters){
double sumY = 0.0, sumX = 0.0, sumXY = 0.0, sumX2 = 0.0;
double A=0,B=0;
for ( int i = 0; i < N; i ++ ){
sumY +=Y[i];
sumXY +=X[i]*Y[i];
sumX +=X[i];
sumX2 +=X[i]*X[i];
}
B=(sumXY*N-sumX*sumY)/(sumX2*N-sumX*sumX);
A=(sumY-sumX*B)/N;
Parameters.set(0, A);
Parameters.set(1, B);
}
}


Result:


intercept = -3.3333333333333335 slope = 5.0
intercept = -1102.169141076954 slope = 9.075536028198574E-7

Process finished with exit code 0

 

Rosh

I agree that these formulas will give the same result here is matcad

I see that the results coincide with MQL and Java, but matcad has never failed me before, so I have doubts. I've checked it and sorted the results.

I checked it and sorted by X and calculated the coefficients again.

RESULT has changed !!!, this shouldn't be the case. Most likely the error is due to error accumulation due to squaring large numbers(Candid is right). I investigated the literature and found a simpler formula, no squaring and seemingly less calculations.

The result is the same as in matcad, and it doesn't depend on sorting.

I recommend using this formula to calculate the linear regression coefficients.

//+------------------------------------------------------------------+
//|                                                       LinReg.mq4 |
//|                                                    Привалов С.В. |
//|                                             Skype -> privalov-sv |
//+------------------------------------------------------------------+
#property copyright "Привалов С.В."
#property link      "Skype -> privalov-sv"

//+------------------------------------------------------------------+
//| script program start function                                    |
//+------------------------------------------------------------------+
int start()
  {
//----
   int      N=6;                 // Размер массива
   double   Y[],X[],A=0.0, B=0.0;
   
  ArrayResize(X,N);
  ArrayResize(Y,N);
      
// проверка 
    X[0]=1216640160;
    X[1]=1216640100;
    X[2]=1216640040;
    X[3]=1216639980;
    X[4]=1216639920;
    X[5]=1216639860;
    
    Y[0]=1.9971;
    Y[1]=1.9970;    
    Y[2]=1.9967;
    Y[3]=1.9969;    
    Y[4]=1.9968;    
    Y[5]=1.9968;
    
    
  LinearRegr(X, Y, N, A, B);
  
  Print("A = ", DoubleToStr(A,8)," B = ",DoubleToStr(B,8));
           
//----
   return(0);
  }
//+------------------------------------------------------------------+
//| Рассчет коэффициентов A и B в уравнении                          |
//| y(x)=A*x+B                                                       |
//| используються формулы https://forum.mql4.com/ru/10780/page5       |
//+------------------------------------------------------------------+

void LinearRegr(double X[], double Y[], int N, double& A, double& B)
{
      double mo_X = 0.0, mo_Y = 0.0, var_0 = 0.0, var_1 = 0.0;
      
    for ( int i = 0; i < N; i ++ )
      {
        mo_X +=X[i];
        mo_Y +=Y[i];
      }
    mo_X /=N;
    mo_Y /=N;
        
    for ( i = 0; i < N; i ++ )
      {
        var_0 +=(X[i]-mo_X)*(Y[i]-mo_Y);
        var_1 +=(X[i]-mo_X)*(X[i]-mo_X);
      }
        A = var_0 / var_1;
        B = mo_Y - A * mo_X;
}

Attached script, if someone will clean up LinearRegr (to prevent errors when working with real data and to increase performance), it will be good. I interchanged A and B, because

The notation y(x)=a*x+b is more familiar (to me from books).

Files:
linreg_1.mq4  2 kb
 

I don't see how the result can depend on sorting. Sorting is not explicitly used anywhere in the formulas.


Besides, the latter algorithm uses expectation values of X and Y and potentially can also introduce an error in calculations. And another thing: using two loops against one is unlikely to improve performance.


If we need to perform massive linear regression calculations on a number of price sequences, it's better to select separate buffers in an indicator and calculate using the cumulative total method. It allows to accelerate the calculations by orders of magnitude. Example - Kaufman AMA optimized : Perry Kaufman AMA optimized

 
Rosh писал (а) >>

I don't see how the result can depend on sorting. Sorting is not explicitly used anywhere in the formulas.


Besides, the last algorithm uses expectation of X and Y values, and potentially can also introduce some error in calculations. One more thing: using two loops against one would hardly improve performance.

If we need to make the bulk calculations of the linear regression for a number of price sequences, it's better to select separate buffers in an indicator and use the cumulative totalization method. It allows to accelerate the calculations by orders of magnitude. Example - Kaufman AMA optimized : Perry Kaufman AMA optimized

1. The point is that the result should not depend on sorting, while the algorithm does. Check it out.

//+------------------------------------------------------------------+
//| script program start function                                    |
//+------------------------------------------------------------------+
int start()
  {
//----
   int      N=6;                 // Размер массива
   double   Y[],X[],Y1[],X1[],A=0.0, B=0.0;
   
  ArrayResize(X,N);
  ArrayResize(Y,N);
  ArrayResize(X1,N);
  ArrayResize(Y1,N);
      
// проверка 
    X[0]=1216640160;
    X[1]=1216640100;
    X[2]=1216640040;
    X[3]=1216639980;
    X[4]=1216639920;
    X[5]=1216639860;
    
    Y[0]=1.9971;
    Y[1]=1.9970;    
    Y[2]=1.9967;
    Y[3]=1.9969;    
    Y[4]=1.9968;    
    Y[5]=1.9968;
    

// отсортируем массив по возрастанию X (исходный массив был по убыванию)
  for (int i = 0; i < N; i++)
   {
   X1[i]=X[N-i-1];
   Y1[i]=Y[N-i-1];
//   Print(X[i], " ", X1[i], " ", Y[i], " ", Y1[i]);
   }            
//----
// 
  LinearRegr(X, Y, N, A, B);
  Print("A = ", DoubleToStr(A,8)," B = ",DoubleToStr(B,8));
  LinearRegr(X1, Y1, N, A, B);
  Print(" A = ", DoubleToStr(A,8)," B = ",DoubleToStr(B,8));

  LinearRegr1(X, Y, N, A, B);
  Print("A = ", DoubleToStr(A,8)," B = ",DoubleToStr(B,8));
  LinearRegr1(X1, Y1, N, A, B);
  Print(" A = ", DoubleToStr(A,8)," B = ",DoubleToStr(B,8));

   return(0);
  }

//-------------------------------------------------------------------------------
// использование этой формулы приводит к ошибкам если X=Time
// формула предложена вот тут https://forum.mql4.com/ru/10780/page4
//| y(x)=A+B*x  

void LinearRegr(double X[], double Y[], int N, double& A, double& B)
{
      double sumY = 0.0, sumX = 0.0, sumXY = 0.0, sumX2 = 0.0;
      
    for ( int i = 0; i < N; i ++ )
    {
        sumY   +=Y[i];
        sumXY  +=X[i]*Y[i];
        sumX   +=X[i];
        sumX2  +=X[i]*X[i];
    }
   B=(sumXY*N-sumX*sumY)/(sumX2*N-sumX*sumX);
   A=(sumY-sumX*B)/N;
}

//+------------------------------------------------------------------+
//| Формула предлагаемая мной                                        |
//| Рассчет коэффициентов A и B в уравнении                          |
//| y(x)=A*x+B                                                       |
//| используються формулы https://forum.mql4.com/ru/10780/page5       |
//+------------------------------------------------------------------+

void LinearRegr1(double X[], double Y[], int N, double& A, double& B)
{
      double mo_X = 0.0, mo_Y = 0.0, var_0 = 0.0, var_1 = 0.0;
      
    for ( int i = 0; i < N; i ++ )
      {
        mo_X +=X[i];
        mo_Y +=Y[i];
      }
    mo_X /=N;
    mo_Y /=N;
        
    for ( i = 0; i < N; i ++ )
      {
        var_0 +=(X[i]-mo_X)*(Y[i]-mo_Y);
        var_1 +=(X[i]-mo_X)*(X[i]-mo_X);
      }
        A = var_0 / var_1;
        B = mo_Y - A * mo_X;
}

The result is

2008.07.30 13:51:08 LinReg EURUSD,M1: A = 0.00000090 B = -1098.77264952

2008.07.30 13:51:08 LinReg EURUSD,M1: A = 0.00000090 B = -1098.77264952

2008.07.30 13:51:08 LinReg EURUSD,M1: A = -1078.77267965 B = 0.00000089

2008.07.30 13:51:08 LinReg EURUSD,M1: A = -1102.16914108 B = 0.00000091

This should not be the case.

I can see that two loops appear, that's why I asked for faster performance. The 'Regression: what is it?' algorithm may be faster, but we should optimize it too (I think Vinin has already done it).

3. Thanks for Kaufmann, it's a good indicator. In case you haven't forgotten before the second championship I was catching inaccuracies in it. Thank you for correcting them.

Z.U. I would like to ask those who haveMatlab. Type in these arrays and calculate the built-in formulas (as far as I remember, there are), and post the result here. In order to come to a consensus. Thank you. Help )). Rosh is pretty hard to convince, but I have a military forehead as well )))

 
Prival писал (а) >>

2. I can see that two loops appear, that's why I asked for a speedup. The ' Regression: what is it?' algorithm may be faster but we should optimize it too (I think Vinin has already done it).

LWMA is indeed safer than X*X, so your work with Mathemat takes on a new meaning :). But I still consider my first recommendation (shift the origin of coordinates) as the best option. Is the formal replacement of Time[pos] by Time[pos]-Time[Bars-1] everywhere such a risk of error?

 
Prival писал (а) >>

1. That's the thing: the result shouldn't depend on sorting, but in that algorithm it does. Check it out.

Result

2008.07.30 13:51:08 LinReg EURUSD,M1: A = 0.00000090 B = -1098.77264952

2008.07.30 13:51:08 LinReg EURUSD,M1: A = 0.00000090 B = -1098.77264952

2008.07.30 13:51:08 LinReg EURUSD,M1: A = -1078.77267965 B = 0.00000089

2008.07.30 13:51:08 LinReg EURUSD,M1: A = -1102.16914108 B = 0.00000091

This should not be the case.



Insert the ristrintokwa in your code:

//-------------------------------------------------------------------------------
// использование этой формулы приводит к ошибкам если X=Time
// формула предложена вот тут https://forum.mql4.com/ru/10780/page4
//| y(x)=A+B*x  
 
void LinearRegr(double X[], double Y[], int N, double& A, double& B)
{
      double sumY = 0.0, sumX = 0.0, sumXY = 0.0, sumX2 = 0.0;
      
    for ( int i = 0; i < N; i ++ )
    {
        sumY   +=Y[i];
        sumXY  +=X[i]*Y[i];
        sumX   +=X[i];
        sumX2  +=X[i]*X[i];
    }
   Print("sumY = ",DoubleToStr(sumY,8)," sumX = ",DoubleToStr(sumX,8)," sumXY = ",DoubleToStr(sumXY,8)," sumX2 = ",DoubleToStr(sumX2,8));
   Print("sumXY*dN-sumX*sumY = ",DoubleToStr(sumXY*dN-sumX*sumY,8));    
   Print("sumX2*dN-sumX*sumX = ",DoubleToStr(sumX2*dN-sumX*sumX,8));    
 
   B=(sumXY*N-sumX*sumY)/(sumX2*N-sumX*sumX);
   A=(sumY-sumX*B)/N;
}

Get something like this:

first call
sumY = 11.98130000 sumX = 7299840060.00000000 sumXY = 14576928951.87000100 sumX2 = 8881277483596863500.00000000
sumXY*dN-sumX*sumY = 0.34199524
sumX2*dN-sumX*sumX = 376832.00000000
A = -1102.16914108 B = 0.00000091
second call
sumY = 11.98130000 sumX = 7299840060.00000000 sumXY = 14576928951.87000300 sumX2 = 8881277483596864500.00000000
sumXY*dN-sumX*sumY = 0.34202576
sumX2*dN-sumX*sumX = 385024.00000000
A = -1078.77267965 B = 0.00000089

This is another pitfall of computer calculations and rounding. On the one hand I myself did not expect such a rake, but on the other hand such a difference is understandable when two series of values (X and Y) have too much difference in the order of values.