OpenCL: real challenges - page 7

 

1) Pragmas are a requirement for support at compile time, not activation of support itself (as you seem to think). So cl_khr_fp64 is already involved if your vises supports it.

2) And if the array size changes at runtime? Of course, it can be done in this particular code, but it won't make the situation any better.

Let me tell you right away that I was profiling in AMD CodeXL:

3) If we take only the computing time of the kernel itself, any parallelized task on the GPU will gain benefit by utilizing more cores on the CPU. So even 8 tasks are enough to speed things up.

4) I myself have a lot of questions about Local calculation formula. The biggest gain took place when at work_dim=1 I spread tasks over all cores of the widget and this is UNITS.

And why do you divide buffer size in general, when you should divide number of its elements? - which I actually did.

Mathemat: In short: what does your code have to do?

To show that the stage of preparing to calculations is not instantaneous and buffer transferring takes a lot of time; it questions the practicability of using OpenCL even for fueled tasks.

It also shows that the CPU is not selected in the tester.

 
Roffild:

Showing that the stage of preparation for calculations is not instantaneous and buffer transfer takes plenty of time, which calls into question the practicability of using OpenCL even for hyped up tasks.

That is, it's quite silly to yell about it; to measure it, on the other hand, is quite another matter; it may be practically useful.

It also shows that CPU is not selected in the tester.

Maybe it is justified, but maybe it is over-insurance. Anyway, I'm sure it was done consciously to ensure efficiency of the testing process itself, or rather optimization (as it is multi-threaded). Here chances to achieve inclusion may appear if the notion of testing and optimization are clearly and completely separated (on the level of party politics), i.e. they are defined as different logical types of tester use. With corresponding (officially different) software support. (This would be good in many ways, and I'm a long-time supporter of such separation/distinction. Right down to different buttons to start optimisation and testing.)

Theoretically, CPU selection could then be allowed during testing and disallowed during optimization (this is correct).

 
Roffild: 1) Pragmas are a compile-time support requirement, not an activation of the support itself (as you seem to think). That is, cl_khr_fp64 is already involved if the viscera supports it.

Yeah, I overdid it with the pragma. If you will keep working on your widget and don't pass the code on to anyone else, no problem. But if somebody wants to read it on Barts card (say, 6870), there will be problems. The kernel code will try to execute without showing errors.

4) I have a lot of questions myself about the formula for calculating Local. The biggest gain was when work_dim=1 spreads tasks over all cores of the widget, and this is UNITS.

Not necessarily. It is often much more useful to increase the work in the kernel itself. It is to even out the overhead associated with data transfer.

And your UNITS is just a number of SIMD engines. According to the documentation,

local_work_size[] sets a subset of tasks to be executed by the specified OpenCL program kernel. Its dimension is equal to work_items[]and allows to cut the total subset of tasks into smaller subsets without division residue . In fact, the size of arraylocal_work_size[] must be chosen so that the global task set work_items[] is sliced into smaller subsets. In this example ,local_work_size[3]={10, 10, 10} will do , sincework_items[40, 100, 320] can be assembled from the array local_items[10, 10, 10] without any residue .

SIMD engines number is a strictly hardware constant which doesn't have to divide up the global task at all.

But first you need to properly evaluate the global problem itself.

About the CPU in the tester - I see, I'm convinced.

 
MetaDriver:

Well, this is not news at all. I mean, it's silly to scream about it, but measuring it is another matter entirely; it can be practically useful.

Except that for some reason I had to take these measurements... When you read "there is a transmission delay" you have no idea how big it is.

Mathemat: And your UNITS is just a number of SIMD engines. According to the documentation,

The number of SIMD engines is a strictly hardware constant, which doesn't have to divide the global task at all.

Let's better use official documentation:

CL_DEVICE_MAX_COMPUTE_UNITS cl_uint The number of parallel compute units on the OpenCL device. A work-group executes on a single compute unit. The minimum value is 1.
local_work_size.
Points to an array of work_dim unsigned values that describe the number of work-items that make up a work-group (also referred to as the size of the work-group) which will execute the kernel specified by the kernel.
So my conclusions are correct and confirmed by AMD CodeXL run
 

The point is different. Call your units barrels, but the fact remains that units in your code does not divide the global task integer (mine certainly does not: 240/28 is not integer; yours too, since you have units=18). This is a bug.

And second, here and now you are working with OpenCL for MQL5 (well, that's not right, but you got me); that, after all, is a different OpenCL than the one from Khronos.

P.S. I didn't create the hyperlink; I just got it by myself :)

Roffild:
CL_DEVICE_MAX_COMPUTE_UNITS cl_uint The number of parallel compute units on the OpenCL device. A work-group executes on a single compute unit. The minimum value is 1.

See other sources for definition of "compute units".

By the way, here is a table from my second article. It would be nice if you understood all these compute units (18), stream cores (288), processing elements (1440), max wavefronts/GPU (496) and work-items/GPU (31744). I haven't figured it out yet.


 
Mathemat:

The point is different. Call your units barrels, but the fact remains that units in your code does not divide the global task integer (mine certainly does not: 240/28 is not integer; yours too, since you have units=18). This is a glitch.

So why did you base the number of bytes on 240? You may be able to do it, but the graphics card can't. So 240/8 = 30 doubles.

240 bytes is the size of the entire buffer of 30 doubles.

And "pick a whole divider" is only a recommendation from the official documentation. And that recommendation doesn't work perfectly.

And what about UNITS is not my own; it's just advice from OpenCL forums. I tested it and got maximum speed...

Mathemat:

And the second thing: here and now you are working with OpenCL for MQL5 (well, that's not right, but you got me) where it is, after all, a different OpenCL than the one by Khronos.

And what is the "other" one?

You're confusing proprietary implementations and simple wrappers. OpenCL MQL is just a wrapper over Khronos OpenCL API. About the difference between OpenCL MQL and Khronos.

 
Mathemat: By the way, here is the table from my second article. It would be nice if you understood all these compute units (18), stream cores (288), processing elements (1440), max wavefronts/GPU (496) and work-items/GPU (31744). I haven't figured it out yet.

compute units is the number of simultaneous tasks being executed.

max wavefronts/GPU (496) and work-items/GPU (31744) are the queue to run.

AMD CodeXL already has an answer to all these questions.

 
Roffild:

compute units is the number of simultaneous tasks being executed.

max wavefronts/GPU (496) and work-items/GPU (31744) are the queue for execution.

AMD CodeXL can help you finally - it answers all these questions.

Maybe I don't understand something, sorry. But do you know Alexey personally? But from the side it doesn't look like..... you speak too cheekily, cleverer than others? being clever is not a sin, to boast about it among brothers in spirit is shameful though...

 

I'm a simple dude and I answer to the point.

If you really want to understand OpenCL and not just assume, you will have to put AMD CodeXL and create your own C/C++ wrapper.

I can post my wrapper, but it has some illogical lines due to my lack of practice in C/C++.

 
Roffild: So why did you take 240 bytes as a base number? You may be able to do it, but the video card cannot. So 240/8 = 30 doubles.

240 bytes is the size of the whole buffer of 30 doubles.

Look at your own code:

uint units = (uint)CLGetInfoInteger(hcontext, CL_DEVICE_MAX_COMPUTE_UNITS);
uint global_work_offset[] = {0};
uint global_work_size[1];
uint local_work_size[1];
global_work_size[0] = ArraySize(price);
Print( "Глобальная задача = ", global_work_size[0] );  /// Это я добавил, вышло 240. Но это и так легко подсчитать: 30*double = 240
local_work_size[0] = global_work_size[0] / units;

Further, in the last line, you yourself divide 240 by 18 (these are units for your map).

And "pick up the whole divider" is only a recommendation from the official documentation. And this recommendation does not work perfectly.

We are working with MQL5 OpenCL. I am referring to documentation on our website. Of course, I am also looking at Khronos.

And as for UNITS, it's not my own words, but some advice from OpenCL forums. I've tested it and got the maximum speed...

Well, I got maximum speed with different parameters. So?

Let's just give you a rough idea. 18 tasks running simultaneously on GPU flies is the maximum that can be done on 4-5 strings of CPUs. And a CPU on x86 emulation can organise many more threads. At least if it's Intel. My former Pentium G840 (2 cores) gave an acceleration of about 70x - on two units! Not to mention what my current... i7, so to speak.

A well-parallelized task (have a look at MetaDriver scripts from the first ocl branch) allows to reach speed up to 1000 or more on GPU (in comparison with 1 thread on CPU in MQL5). If you can't find it - I can download it for you to try on your card.

If you really want to understand OpenCL and not only guess it, you will have to put AMD CodeXL and create your own C/C++ wrapper.

OK, I'll have a look at it, thanks.