So you finally found a way to post gif without moderator intervention.
I found some issues myself :
- the output buffer size was x*y*8 bytes while it should be x*y*4 bytes
- but the real speed gain , the first one for now , came after following @William Roeder's suggestion on another thread for the ambiguity of the calculation of the power function
Now the OpenCL+GPU is 5 times faster than the default ! , so thank you William . I don't understand why but i assume it has to move memory around to do the conversions and it slows it down.
The OpenCL+CPU is 2.5 times slower though.
i'm attaching the updated code.
The code is far from perfect and updates will be posted for each "gain" in speed for this particular code. (it may not speed up a neural nets kernel for instance but it will show you general approaches etc)
Now , you will be looking for the device info documentation constantly , here is the khronos docs that mql5 themsevles suggest you to refer to :
CLGetDeviceInfo : https://registry.khronos.org/OpenCL/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html
And all the docs : https://registry.khronos.org/OpenCL/
I'd like to know which open cl api commands this parameter uses in this function
Function : https://www.mql5.com/en/docs/opencl/clexecute
parameter : const uint& local_work_size[] // Number of tasks in the local group
For instance , if a user has a gpu with 32 warps(nvidia)/wavefronts(ati/amd) how could this array be adjusted ? If you have that information of course.
(for instance , if someone wanted to change the equivalent of this parameter in open cl in c++ what would they alter/add ?)- www.mql5.com
I'd like to know which open cl api commands this parameter uses in this function
Function : https://www.mql5.com/en/docs/opencl/clexecute
parameter : const uint& local_work_size[] // Number of tasks in the local group
For instance , if a user has a gpu with 32 warps(nvidia)/wavefronts(ati/amd) how could this array be adjusted ? If you have that information of course.
(for instance , if someone wanted to change the equivalent of this parameter in open cl in c++ what would they alter/add ?)Local memory is an advanced optimization technique in OpenCL (really, a hard subject). Another optimization technique is kernel vectorization.
In local memory, you have to use sync objects (named barriers, or memory fences) to sync the locally executing threads, and you should test various optimizations, which is not portable between different graphic cards' manufacturers.
See this article https://www.mql5.com/en/articles/407
Section: 2.7. Transferring the Column of the Second Array to Local Memory
- www.mql5.com
Local memory is an advanced optimization technique in OpenCL (really, a hard subject). Another optimization technique is kernel vectorization.
In local memory, you have to use sync objects (named barriers, or memory fences) to sync the locally executing threads, and you should test various optimizations, which is not portable between different graphic cards' manufacturers.
See this article https://www.mql5.com/en/articles/407
Section: 2.7. Transferring the Column of the Second Array to Local Memory
I haven't picked the most suitable example for this as each pixel can be independent it seems .
I'm just trying to relate the argument of the function to external open cl tutorials to grasp the operation better.
Also there was no ArgMemLocal at the time of this article been written i think , shouldn't he/she have used it ?
ps: note i am an utter noob in such mattersAlso there was no ArgMemLocal at the time of this article been written i think , shouldn't he/she have used it ?
There is now:
bool CLSetKernelArgMemLocal(
int kernel, // handle to a kernel of an OpenCL program
uint arg_index, // number of the OpenCL function argument
ulong local_mem_size // buffer size
);
- www.mql5.com
I did not play with local memory the time i was testing OpenCL. I found it very complicated, even on other resources on the internet. I only tested vectorized kernels.
I managed to implement Bitonic sort that beats radix sort in speed (by optimizing the kernel algorithm).
But, finally OpenCL is a non-portable solution for parallel-processing. Now, it is almost dead!
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
You agree to website policy and terms of use
Hello
Sharing a benchmark for OpenCL that can process an image and make it look as if it is coming in and out of focus constantly.
The normal execution is 3 times faster than OpenCL which means , it can be improved. (3x faster with 0min 3max will explain parameters lower)
I don't need to use this code , i just wanted to be immersed in OpenCL a bit and i believe that such a simple algorithm could be helpful for starting out with OpenCL.
I'm not judging the OpenCL native library , i know there's probably issues in my approach of the solution -that ends up making the OpenCL blur 3 times slower-
Now , if you see a blatant ,or not , omission or fundamental error in the approach on my part , let the forum know of course .
Note : Not using any built in GPU functionality for the image , the "standard" everyday poor man's solution for blurring is also available in its mql5 form so you can compare.
With that said , after you start laughing when you read my OpenCL C source code, please also share your insights publicly they will be helpful to me and readers alike. 😊
Anyway. Here is what the algorithm does :
You have an input at the top where you can select which mode of execution to run :
Below you have a parameter for the bmp file to load to play with , its an image of a pizza 🤏 , i'll attach that too .
Then you have the blurPulseMin and the blurPulseMax . What are those ?
Suppose you set the min to 0 and the max to 10 .
That will create a "blur pulse" that oscillates between 0 and 10 constantly and at each point , if bp is the blur pulse , it will create a region of ((bp)*2)^2 around each pixel from which it will derive neighboring pixel color data to mix it in the final blurred pixel and compose the blurred photo ,and update the display.
This is my 2nd day with OpenCL and 1st with C but i think i kinda got the structure of things more or less , so , if you have any questions about the code in the source file ask , if i can't answer there are many others who can.
Here is a visual of the test , the milliseconds interval measured is for a full cycle of the "blurPulse" going from min to max and back to min.
That was the benchmark attempt.
The gif below runs in standard mode and takes 28seconds roundtrip with min0 and max10.
Cheers , i apologize if i made you hungry . 😎 🍺 🍕