How switching to float in a high-performance program can boost the quality of video.

Serhii Shevchuk 2020.03.11 10:21 #21

Rorschach:

Thank you very much!

Are your calculations in double? Then the result is particularly impressive.

Good point. Indeed, in one place it was float instead of double:

D = fast_length(p);

To make it double, we need to correct it to

D = length(p);

Also, you can get rid of the last line (analogous to XRGB macro) by applying uchar4 instead of uint for the *buf argument. Try it that way:

__kernel void Func(int N, __global double *XP, __global double *YP, __global uchar *h, __global uchar4 *buf)
{
   size_t X = get_global_id(0);
   size_t Width = get_global_size(0);
   size_t Y = get_global_id(1);
   
   float2 p;
   double D=0,S1=0,S2=0;
   
   for(int w=0;w<N;w++){ 
      p.x = XP[w]-X;
      p.y = YP[w]-Y;
      D = length(p);
      S2+=D;
      if(w<N/2)
         S1+=D;
   }   
   //
   double d=S1/S2;
   buf[Y*Width+X] = (uchar4)(0xFF,h[(int)(d*11520)],h[(int)(d*17920)],h[(int)(d*6400)]);
}

Rorschach 2020.03.11 11:08 #22

Serhii Shevchuk:

It's really impressive! Fps is 1.5 times higher than on shaders.

Thanks, the code works, no performance degradation, just moved the alpha channel to the end in the last line:

buf[Y*Width+X] = (uchar4)(h[(int)(d*11520)],h[(int)(d*17920)],h[(int)(d*6400)],0xFF);

Serhii Shevchuk 2020.03.11 11:40 #23

Rorschach:

It's really impressive! Fps is 1.5 times higher than on shaders.

Thanks, the code works, performance is not degraded, only in the last line the alpha channel was moved to the end:

In this implementation performance is highly dependent on number of centres (N). Ideally we should get rid of the loop in the kernel but in this case we will have to make S1 and S2 integer. We will have to see how big is the spread of their values and maybe we can multiply them by something without too much sin. And then it is possible to transfer loop to third dimension, which theoretically will give performance gain at high values of N.

I get 80-90 fps on my HD 7950 on full-hd monitor at N=128.

Machine learning in trading: iSAR function issue [Archive!] Any rookie question,

Rorschach 2020.03.11 12:31 #24

Serhii Shevchuk:

In this implementation the performance is highly dependent on the number of centres (N). It would be good to get rid of the loop in the kernel, but then S1 and S2 would have to be integer. We will have to see how big is the spread of their values and maybe we can multiply them by something without too much sin. And then it's possible to transfer a loop into third dimension, that theoretically will give performance gain at high values of N.

I get 80-90 fps on my HD 7950 on a full-hd monitor at N=128.

My 750ti has 11 fps. I found the specs for the cards:

GPU	FP32 GFLOPS	FP64 GFLOPS	Ratio
Radeon HD 7950	2867	717	FP64 = 1/4 FP32
GeForce GTX 750 Ti	1388	43	FP64 = 1/32 FP32

Logically, switching to float can raise fps by a factor of 4 on amd and 32 on nvidia.

Discussion of article "How OpenCL: internal implementation tests Backtesting/Optimization

Rorschach 2020.03.12 00:00 #25

Made a copy of the OCL version. I'm shocked at the result. At 1600 centres couldn't load the videoad more than 85%.

Optimisation with pre-calculated h has no effect, but left with it. All calculations are in float, I don't see the point of using double since all functions return float anyway.

Files:

pixel.zip 1 kb

Swirl2_OCL_mod.mq5 14 kb

Float The principles of non-syndicator My Bad Maths :(

Rorschach 2020.03.13 13:31 #26

Some conclusions.

1) Pre-calculation of h had no effect.

2) I didn't notice any difference when getting rid of if

S1+=D*(i<0.5 f*N);
//if(i<N*0.5 f) S1+=D;

3) It is slower,

for(int i=0;i<N;i++)
  {XP[i]= (1.f-sin(j/iArr[2*i  ].w))*wh.x*0.5 f;
   YP[i]= (1.f-cos(j/iArr[2*i+1].w))*wh.y*0.5 f;
  }
float S1=0.f,S2=0.f;
for(int i=0;i<N;i++)
  {float2 p;
   p.x=XP[i]-Pos.x;
   p.y=YP[i]-Pos.y;
   ...
  }

than it

float S1=0.f,S2=0.f;
for(int i=0;i<N;i++)
  {float2 p;
   p.x=(1.f-sin(j/iArr[2*i  ].w))*wh.x*0.5 f-Pos.x;
   p.y=(1.f-cos(j/iArr[2*i+1].w))*wh.y*0.5 f-Pos.y;
   ...
  }

it seems...

4) Can't offload CPU, i.e. scalper stacks etc. can be forgotten.

How to distribute calculations [WARNING CLOSED!] Any newbie Strategic foresight systems

Реter Konow 2020.03.13 13:59 #27

Rorschach:

Some conclusions.

1) Pre-calculation of h had no effect.

2) No difference in getting rid of if

3) It is slower,

than it

it seems...

4) Can't offload CPU, i.e. scalper stacks etc. can be forgotten.

In MT5, scalper stacks can easily run in one thread without OCL and not overload the CPU. Of course, this does not mean that it is not needed. Just an FYI.

If you use CCanvas class when building a mullion, then the workload will be proportional to the area of mullion. That is, the bigger the window in the mull, the heavier the load, because the entire canvas is being redrawn, not the individual, changed parts. However, turning the beaker cells into independent elements, they can be updated independently from the rest of the kanvas area, reducing redrawing time and the load on the processor caused by it by several times.

Making a crowdsourced project Errors, bugs, questions My approach. The core

Rorschach 2020.03.13 17:06 #28

Реter Konow:

On MT5, scalper stacks can easily work in one thread, without OCL and without overloading the processor. Of course, that doesn't mean it's not needed. Just an FYI.

If you use CCanvas class when building a mullion, then the workload will be proportional to the size of the mullion. That is, the bigger the window in the mull, the heavier the load, because the entire canvas is being redrawn, not the individual, changed parts. However, turning the beaker cells into independent elements, they can be updated independently from the rest of the kanvas area, reducing redrawing time and the load on the processor caused by it by several times.

4th point in question. In the Remnant3D example, the CPU is barely loaded.

Реter Konow 2020.03.13 17:38 #29

Rorschach:

Point 4 is in question. The Remnant3D example hardly loads the CPU.

This has been tested. The CPU on MT5, in case of normal dynamics of the canvas, is almost not loaded, if you redraw individual cells in which the value has changed.

On the contrary, if every incoming value is redrawn over the entire canvas area, the processor will be much stressed.

The difference is in the number of re-initialized values in the pixel array. You need to selectively update individual areas and you won't have load problems.

Making a crowdsourced project Errors, bugs, questions Asynchronous and multi-threaded programming

Rorschach 2020.03.13 18:12 #30

Реter Konow:

This has been tested. The CPU on MT5, in case of normal dynamics of the canvas, is almost not loaded, if you redraw individual cells in which the value has changed.

On the contrary, if every incoming value is redrawn over the entire canvas area, the processor will be much stressed.

The difference is in the number of re-initialized values in the pixel array. You need to selectively update individual areas and you won't have load problems.

That's the thing about Remnant3D: it's a full-screen canvas and doesn't load the CPU.

DirectX - page 3