![MQL5 - Language of trade strategies built-in the MetaTrader 5 client terminal](https://c.mql5.com/i/registerlandings/logo-2.png)
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
Look at your own code: And then, in the last line, you yourself divide 240 by 18 (that's units for your card).
You're obviously confused about something. Here's the controversial piece:
Conclusion: global=30 local=1
And 240 bytes is exactly when you create the buffer.
You're obviously confused about something. Here's a controversial piece:
Output: global=30 local=1
And 240 bytes exactly when creating the buffer.
global_work_size[0]
And local_work_size[0] = (uint) 240/18 = 13
P.S. Yes, you got it right. Pardon. I got a bit confused.
local_work_size[0] = (uint) 30/18 = 1. And I have the same, since units=28.
Again, Roffild:
Mathemat: Давай тупо прикинем. 18 задач, выполняемых одновременно на мухах GPU, - это максимум то, что можно сделать на 4-5 нитках CPU. А CPU на x86 эмуляции может организовать гораздо больше ниток. Во всяком случае, если это Intel. Мой бывший Pentium G840 (2 ядра) дал ускорение примерно в 70 раз - на двух unit'ах! Я уже не говорю о том, что вытворяет мой текущий... условно говоря, i7.
A well-parallelized task (see MetaDriver's scripts from the first ocl thread) can achieve speedups up to 1000 or more on GPU (compared to 1-thread execution on CPU on MQL5). If you can't find it - I can send it to you, you can test it on your card.
Have you figured out the buffer and its speed?
You'd better use AMD CodeXL to figure out UNITS etc. - it has nice performance graphs.
AMD CodeXL itself is glitchy but it is difficult to draw any conclusions without it.
I'm not going to use OpenCL until the tester allows me to use CPU or until I run a task that lasts longer than Number of_buffers * 0.353 msec.
P.S.
I did finish optimizing my code and the final variant passes the test in 33 seconds (320 seconds - before optimization, 55 seconds - "OpenCL-style").
There's nothing to figure out. It is clear that it is a slow operation. Conclusion - increase work inside the kernel (there is too little of it in your code).
And buy a more modern video card, it seems to have become better with it.
AMD CodeXL itself is glitchy but it's hard to draw any conclusions without it.
Intel's utility is rather useful too, but for Intel stones. Well, and for catching the most obvious errors in the kernel.
P.S. I have, after all, finished optimizing my code and the final variant passes the test in 33 seconds (320 seconds - before optimization, 55 seconds - "OpenCL-style").
It's already much better.
Today I needed to generate an array with 1 bit in numbers.
At the same time I practiced with OpenCL.
I'm posting the code as a demonstration of an interesting method to calculate global_work_size and local_work_size. The idea itself is taken from IntrotoOpenCL.pdf (I have a copy), but I tweaked it.