OpenCL: real challenges - page 4

 
TheXpert:

The hardest method is FeedPatterns.

At a glance, there is no way to speed it up: there is very little work for the kernels relative to the amount of data (much data, little work), all the gains will be eaten up by copying back and forth.

You can try to explicitly ask the compiler to parallelize it in VS or create parallel threads for the processor.

OpenCL: от наивного кодирования - к более осмысленному
OpenCL: от наивного кодирования - к более осмысленному
  • 2012.06.05
  • Sceptic Philozoff
  • www.mql5.com
В данной статье продемонстрированы некоторые возможности оптимизации, открывающиеся при хотя бы поверхностном учете особенностей "железа", на котором исполняется кернел. Полученные цифры весьма далеки от предельных, но даже они показывают, что при том наборе возможностей, который имеется здесь и сейчас (OpenCL API в реализации разработчиков терминала не позволяет контролировать некоторые важные для оптимизации параметры - - в частности, размер локальной группы), выигрыш в производительности в сравнении с исполнением хостовой программы очень существенен.
 
TheXpert:
Eventually, I'm going to transfer it to MQL.
Is it strictly necessary? I've been wanting to look into AMP for a long time, and now there's a chance...
 
kazakov.v:

At a glance, there is no way to speed it up: there is very little work for kernels relative to the amount of data (much data, little work), all the gain will be eaten up by copying back and forth.

You can try to explicitly tell the compiler to parallelize it in VS or create parallel threads for the processor.

Why, these tasks are perfectly suited to OpenCL

void Solver::FeedPattern(const Pattern &pattern)
  {
   size_t size=pattern.Size();

   assert(size==m_PatternSize);
   if(size!=m_PatternSize)
     {
      return;
     }

   const std::vector<double>&values=pattern.Values();
   double etalon=pattern.Etalon();

   size_t i;

   for(i=0; i<size;++i)
     {
      for(size_t j=0; j<size;++j)
        {
         m_Matrix[i][j]+=values[i]*values[j];
        }
      m_Matrix[i][size]+=values[i];
      m_Matrix[i][size+1]+=values[i]*etalon;
     }

   for(i=0; i<size;++i)
     {
      m_Matrix[size][i]+=values[i];
     }
   m_Matrix[size][size]+=1;
   m_Matrix[size][size+1]+=etalon;
  }

And the green stuff can do it too.

 
Urain:

Why, these tasks are great on OpenCL

and you can do the green stuff too.

You need to do an implementation and comparison test in OpenCL and C++, and if there is an increase, you can translate everything.
 

Does CL_DEVICE_PREFERRED_VECTOR_WIDTH_* indicate maximum vector size or optimal size?

When CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE=2 will double3 and double4 already be slow?

 
Roffild:

Does CL_DEVICE_PREFERRED_VECTOR_WIDTH_* indicate maximum vector size or optimal size?

When CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE=2 will double3 and double4 already be slow?

1. To the maximum.

2- Slowdown is unlikely to be severe, but it will not increase the execution speed.

 
MQL OpenCL is just a wrapper over the original API.
I need to know answers and clarifications on the implementation of this wrapper.

CLContextCreate() = clCreateCommandQueue(clCreateContextFromType(CL_DEVICE_TYPE_X)) ?
Is this actually a queue for one device, not a context?

Are the buffers read/write synchronous or asynchronous?
clEnqueue[Read/Write]Buffer(enqueue, buffer, CL_TRUE) - here CL_TRUE or CL_FALSE ?

bool CLExecute(int kernel) = clEnqueueTask();
bool CLExecute(int kernel, uint work_dim, works...) = clEnqueueNDRangeKernel();
clEnqueueNativeKernel() - not implemented.

Does CLExecute() return control immediately? It doesn't block it for the time of its execution?
It seems that it takes 2-40 ms to set it into the queue...

Now here is the main question:
When and under what conditions is clFinish() called? And because of the absence of clFinish(), it is difficult to queue.

And the MQL help doesn't describe CL_MEM_*_HOST_PTR at all, but they are present there.

I have finally completely converted my indicator to OpenCL style.
Running the test from 2013.01.09 to 2013.10.10 on M5 with "OHLC on M1":
320 seconds - before translation
55 seconds - OpenCL style emulation on MQL5:
// подготовка данных общая и копия kernel на MQL5, а эмуляция через:
for (int get_global_id = maxcount-1; get_global_id>-1; get_global_id--) NoCL(params,m_result,get_global_id);
But the GPU run was frustrating for me :(
I had hoped to run the test in less than 30 seconds but received a total lag for CLBufferWrite!

Loading of video card at 32% and passing the test in 1710 seconds without CL_MEM_*_HOST_PTR
Loading video card at 22% and making a test in 740 seconds with CL_MEM_ALLOC_HOST_PTR
CL_MEM_COPY_HOST_PTR and CL_MEM_USE_HOST_PTR result in CLExecute: 5109 (ERR_OPENCL_EXECUTE)

So how to properly exchange data?

And still no CPU is selected for calculations in the tester.

Videoadapter = ATI Radeon HD 5850
Processor = AMD Phenom(tm) II X4 925 Processor
 
Roffild:
CLContextCreate() = clCreateCommandQueue(clCreateContextFromType(CL_DEVICE_TYPE_X)) ?
Is this actually a queue for one device, not a context?
Yes, context and queue are created per device (research has shown that opencl doesn't work correctly with several different devices).
Are the buffers read/write synchronous or asynchronous?
clEnqueue[Read/Write]Buffer(enqueue, buffer, CL_TRUE) - here CL_TRUE or CL_FALSE ?
Read and write are synchronous.
bool CLExecute(int kernel) = clEnqueueTask();
bool CLExecute(int kernel, uint work_dim, works...) = clEnqueueNDRangeKernel();
clEnqueueNativeKernel() - not implemented.
Does CLExecute() return control immediately? It doesn't block it for the time of its execution?
Yes
And now the main question:
When and under what conditions is clFinish() called ? And because of the absence of clFinish(), it is difficult to form a queue.
Not called, read from memory must be used.
And there are no descriptions of CL_MEM_*_HOST_PTR in MQL Help at all.

I have finally completely converted my indicator to OpenCL style.
Running the test from 2013.01.09 to 2013.10.10 on M5 with "OHLC on M1":
320 seconds - before translation
55 seconds - OpenCL style emulation on MQL5:
But the GPU run was frustrating for me :(
I was hoping the test would run in less than 30 ms and got a total lag for CLBufferWrite!

Loading of video card at 32% and passing the test in 1710 seconds without CL_MEM_*_HOST_PTR
Loading video card at 22% and testing in 740 seconds with CL_MEM_ALLOC_HOST_PTR
CL_MEM_COPY_HOST_PTR and CL_MEM_USE_HOST_PTR result in CLExecute: 5109 (ERR_OPENCL_EXECUTE)

So how to properly exchange data?
The CL_MEM_COPY_HOST_PTR and CL_MEM_USE_HOST_PTR flags are not currently supported by the terminal (we are investigating this issue).
And still no CPU is selected for calculations in the tester.
Have you tried explicitly specifying CPU device?
 

How about trying to give us asynchronous buffers and clFinish() ?

There is an assumption that it is the synchronous writing that is slowing down, which even AMD CodeXL hints at:

"clEnqueueWriteBuffer: Unnecessary synchronization. Blocking write"

And in the CPU tester it's not even selectable by number. Bug #865549.

 
Eh... Articles on speed increase using OpenCL on GPU turned out to be a fairy tale as they did not deal with real tasks :(

This month, I wrote thousands of lines of code to conquer OpenCL.

Thus, to debug OpenCL, I had to emulate functions from MQL to run them through AMD CodeXL in C/C++.

I will repeat the test results from 2013.01.09 to 2013.10.10 on M5 with "OHLC on M1":
320 seconds - before translation
55 seconds - OpenCL style emulation on MQL5

"OpenCL style" is to reduce the number of CopyHigh/CopyTime/CopyOpen/... calls to a minimum. and increasing the amount of code to process arrays after these functions are called.

And these calculations are what the beautiful articles on OpenCL lack:

Test result without OpenCL:
Core 1 EURUSD,M5: 1108637 ticks (55953 bars) generated within 55427 ms (total bars in history 131439, total time 55520 ms)
55427 ms / 1108637 tick = 0.04999 ms/tick - 1 tick per CPU (execution on OpenCL should not exceed this time)

This is what I have got by running my own code in C/C++ and running it through AMD CodeXL:
0.02000 ms - 0.05000 ms - execution of my kernel on GPU
0.35300 ms - one call to clEnqueueWriteBuffer for 168 bytes at 500KB/s
0.35300 ms - one clEnqueueWriteBuffer call for 3.445 KBytes with 9.500 MBytes/s (average transfer time is the same)

168 Bytes is:
double open[21]={1.3668,1.3661,1.36628,1.3664,1.36638,1.36629,1.3664,1.36881,1.36814,1.3692,1.36918,1.36976,1.36816,1.36776,1.36779,1.3695,1.36927,1.36915,1.3679,1.36786,1.36838};

And I got 3,445 KByte because of a 21*168 array size calculation error, but even that didn't affect the transfer time.

To sum up: even if I manage to optimize my kernel to 0.02000 ms, which is indeed ~2 times faster than the usual MQL pass (0.04999 ms), it all comes down to GPU read/write speed (0.35300 ms - ~7 times slower than MQL computation!).

CPU is not selected in my tester for OpenCL, so I can't use another 3 empty cores...

P.S.
55 seconds is not yet the limit of optimization in MQL, it's just an OpenCL emulation when there is no support :)
Документация по MQL5: Доступ к таймсериям и индикаторам / CopyHigh
Документация по MQL5: Доступ к таймсериям и индикаторам / CopyHigh
  • www.mql5.com
Доступ к таймсериям и индикаторам / CopyHigh - Документация по MQL5