OpenCL: real challenges - page 6

 
Mathemat: I haven't checked it in the tester yet.

Well, then why did you post unchecked nonsense?

Maybe I'm the only one who's ever tried OpenCL in the tester after all... tried it...

 
Roffild:

Well, then why did you post unverified nonsense?

I am probably the only one who has used OpenCL in the tester after all... tried it...

It's not nonsense, it's a satisfied request.

Once again, write to servicedesk and justify what you want. If you can't justify it, that's your problem.

At the time of writing these articles, the talk about using OpenCL in the tester had not yet begun in earnest. The application refers to that very time.

It is you who should turn your brain on because 0.35300 ms refers to clEnqueue[Read/Write]Buffer() and not to global memory access inside the kernel.
This command is not present in OpenCL for MQL5 implementation. What are you talking about?
 
Roffild:

Go through my posts again.

The main criterion: execution of the MQL code in the "OpenCL style" for 1 tick must exceed time = Number_Buffers * 0.35300 ms

To find out speed of algorithm in MQL with accuracy to microseconds (1000 microseconds = 1 millisecond) you will have to run several times in tester and Total_Time / Number_of_Ticks (my top post).

Were it not for the buffer delay, my code would pass the test in ~30 seconds - that's ~2 times faster than "OpenCL style" MQL (55 seconds) and ~11 times faster than regular code (320 seconds).

What other criteria are there?

I'm not asking you to re-read all my forum posts on OpnCL. :) A all the basic criteria are described and discussed there.

The main criterion is the t_alg/t_mem ratio where t_alg- algorithmically optimized time of calculating the kernel algorithm and t_mem- access time of deleted (*) memory. The longer this criterion is, the better the speed-up prospects via OpenCL are. In other words, the heavier the calculations and the less array transfers to and from the device are, the better the prospects are.

(*) remote memory = all kinds of "non-register" memory (register memory is very fast), e.g. (1) global device memory is much slower than register memory but much faster than (2) memory external to the device (CPU RAM).

OpenCL: от наивного кодирования - к более осмысленному
OpenCL: от наивного кодирования - к более осмысленному
  • 2012.06.05
  • Sceptic Philozoff
  • www.mql5.com
В данной статье продемонстрированы некоторые возможности оптимизации, открывающиеся при хотя бы поверхностном учете особенностей "железа", на котором исполняется кернел. Полученные цифры весьма далеки от предельных, но даже они показывают, что при том наборе возможностей, который имеется здесь и сейчас (OpenCL API в реализации разработчиков терминала не позволяет контролировать некоторые важные для оптимизации параметры - - в частности, размер локальной группы), выигрыш в производительности в сравнении с исполнением хостовой программы очень существенен.
 
Mathemat:

This is not nonsense, but a satisfied application.

Once again, write to servicedesk and justify what you want. If you can't justify it, that's your problem.

Once again: bug #865549 from 2013.10.17 23:17 and the developers are notified, but unlikely to fix anything. Probably this limitation someone of them added on purpose not to suspend the whole processor during optimization.

But the articles don't say a word about it!

Mathemat:
It is time to turn your brain on because 0.35300 ms refers to clEnqueue[Read/Write]Buffer() and not to global memory access inside the kernel.

This command is not present in the OpenCL for MQL5 implementation. What are you talking about?

Eh... And you're also spouting articles on OpenCL...

Just so you know, clEnqueueReadBuffer = CLBufferRead and clEnqueueWriteBuffer = CLBufferWrite are called synchronously.

The learning starts here

MetaDriver: The main criterion is the ratio t_alg/t_mem, where t_alg is the algorithmically optimized computation time of the kernel algorithm and t_mem is the time to access the deleted (*) memory. In other words, the "heavier" the calculations and the lesser the array transfers to and from the device are, the greater the speedup potential with OpenCL is.
This is a criterion for optimizing execution only. There were no approximate numbers about the transfer rate of the buffers themselves before my posts.
 

Folks, before you argue any further, think about the three posts starting here and specifically about

mql5: In this particular example, the advantage of using OpenCL is eaten up by buffer copying overhead.


Because you are focused on optimizing the kernel itself while my posts are about buffers.

 
Roffild: Just so you know: clEnqueueReadBuffer = CLBufferRead and clEnqueueWriteBuffer = CLBufferWrite and they are called synchronously.

I've known for a long time that OpenCL for MQL5 implementation is just a wrapper over the true API. By the way, I wrote in my second article that the possibility of setting work-group size is missing. Made a request to servicedesk, and they did it after a while.

I also know that CLBuffer[Read/Write] is similar to clEnqueue[Read/Write]Buffer, but these functions are not identical at all: they have different syntaxes.

But I don't understand why you keep talking about clEnqueueXXX functions that are not present in OpenCL for MQL5.

I'll try to understand what you want.

Roffild:

It's time to turn your brain on because 0.35300 ms refers to clEnqueue[Read/Write]Buffer() and not to global memory access inside kernel.

The second can be solved by optimizing the kernel itself while the first is an iron limitation.

Okay. Who do you have a complaint against? The graphics card manufacturer?
 
Mathemat: I also know that CLBuffer[Read/Write] is analogous to clEnqueue[Read/Write]Buffer but these functions are not identical at all: they have different syntaxes.

But I don't understand why you keep talking about clEnqueueXXX functions that are not present in OpenCL for MQL5.

There's no difference there. There's probably such a wrapper:

template<typename T>
cl_int BufferWrite(cl_mem buffer, const T *ptr)
{
        size_t bufsize;
        errcode = clGetMemObjectInfo(buffer, CL_MEM_SIZE, sizeof(bufsize), &bufsize, 0);
        return (errcode = clEnqueueWriteBuffer(enqueue, buffer, CL_TRUE, 0, bufsize, ptr, NULL, NULL, NULL));
}
template<typename T>
cl_int BufferRead(cl_mem buffer, T *ptr)
{
        size_t bufsize;
        errcode = clGetMemObjectInfo(buffer, CL_MEM_SIZE, sizeof(bufsize), &bufsize, 0);
        return (errcode = clEnqueueReadBuffer(enqueue, buffer, CL_TRUE, 0, bufsize, ptr, NULL, NULL, NULL));
}

So you don't have to elaborate on syntax, either. The fact that the 3rd argument = CL_TRUE has already been confirmed.

Mathemat:

The second can be solved by optimizing the kernel itself, while the first is an iron constraint and brain won't help.
Ok. Who do you have a complaint against? The video card manufacturer?

The complaint is to the writers of the articles that there is no practical data about this most important limitation! (There wasn't, until I tested it.)

 
Roffild:

The complaint is directed at the article writers, that there's no practical data on this most important limitation! (There wasn't, until I tested it.)

Don't read any more articles and you won't be complaining. ;)

--

Why are you picking on me? How can you cite unknown data in an article? The lag of data transfer to/from a device is huge and it must be taken into account? The specific figures depend on the specific hardware. So, you tested it on yourself and well done. Sometimes people (myself included) lay out test code to estimate capabilities and limitations of different hardware. They ask other people to share results, people often do it (kudos to them for that), at the same time everybody can see statistics and what works in what combinations. Then somebody re-purchases hardware or changes approaches to writing code with results in mind. What do you want? Well, write a complaint to Sportloto, maybe that will make your code work faster...

:)

 

Actually I've already finished everything on https://www.mql5.com/ru/forum/13715/page5#comment_646513, but the authors of the articles themselves wanted to prove something else.

There is no specific and very important information in your articles, so they are not finished and describe unrealistic tasks.

You may not add info to the articles, but it's silly to pretend that these particular figures mean nothing.

OpenCL: реальные задачи
OpenCL: реальные задачи
  • www.mql5.com
Так что же может дать OpenCL трейдерам?
 

I don't understand the hidden meaning of the script/advisor you posted, Roffild. The code is, to put it mildly, incomprehensible.

- Where is the cl_khr_fp64 pragma? You need it when calculating with double in the kernel.

- Why is there this piece of code in the OnTick() function, which can be put in the initial initialization by calculating once?

uint units = (uint)CLGetInfoInteger(hcontext, CL_DEVICE_MAX_COMPUTE_UNITS);
uint global_work_offset[] = {0};
uint global_work_size[1];
uint local_work_size[1];
global_work_size[0] = ArraySize(price);
local_work_size[0] = global_work_size[0] / units;

- Why is the size of the global task equal to 240 bytes? It would have to be much larger to gain some benefit from OpenCL. Well, it should be at least a million times larger if you can just judge by eye.

- Why should a global task divide by the number of units to get the size of a local one? Both CPU and GPU allow to divide a global task into a much larger number of subtasks. And units in your case is just a number of SIMD engines.

Say, I have, for instance, the number of units in my video card is 28 (Radeon HD7950). But this number doesn't divide exactly 240. It means that a significant portion of calculations may be non-parallel.

The number of shaders I have is 1792. Yours is 1440. This is roughly the number you'd better divide the global task, to properly load the map. But you will have to calculate correct global task size. (And it's better not to divide, but to multiply.)

And what your map is counting all this time is not clear at all.

In short: what should your code do?