OpenCL::Example with Image Blurring Algorithm ,Questions and Issues

Lorentzos Roussos 2023.04.28 20:15 #21

2D mapping of work groups and work items . What it does :

it creates a fake work space of x by y dimensions .

It then logs all the coordinates that arrive in the kernel , the global ids the local ids and the work group ids.

The result is sent to a file after the kernel completes .

From a first glance it appears that the first dimension , x , determines the number of the work items in the work groups and , the second , the number of the works groups .

Now there is a maximum number of work items per work group (1024) in my gpu (don't read the mql5 docs on this they are wrong) and a 256 available work groups for the kernel (if that documentation is correct)

So , i can now test if the docs are correct by creating a -very annoying- 3D test and if it fails above 256 available groups then the docs are right on this one .

here is a snapshot of the output

Item[511]:GLOBAL.ID.X(15):GLOBAL.ID.Y(31):LOCAL.ID.X(15):LOCAL.ID.Y(0):GROUP.ID.X(0):GROUP.ID.Y(31):

GLOBAL.ID.X is get_global_id(0)
GLOBAL.ID.Y is get_global_id(1) (1 means dimension 2)
LOCAL.ID.X is get_local_id(0)
LOCAL.ID.Y is get_local_id(1)
GROUP.ID.X is get_group_id(0)
GROUP.ID.Y is get_group_id(1)

it appears local id y and group id x are not used

So the output is telling us that the "slot" at position 511 was processed in group 31 by work item 15

Edit : if i supply the additional local_size_array in CLExecute and set it to 4,4 this happens :

Item[511]:GLOBAL.ID.X(15):GLOBAL.ID.Y(31):LOCAL.ID.X(3):LOCAL.ID.Y(3):GROUP.ID.X(3):GROUP.ID.Y(7):

Now we can investigate what on earth that does.

neat

Edit2 the CL_KERNEL_WORK_GROUP_SIZE matches the number of items openCL divides the full work load by . For instance , i get that value to be 256 (for the 1D kernel) .
The docs say ,lets skip the what but look at the that are , "that are available for the kernel" .
So to simplify the observation , on the 1D test , i throw 512 items it splits in 2 groups , 1024 items it splits in 4 groups ,2560 it splits in 10 groups and so on.
Now if i start increasing the items per that value and run the kernel each time i will be able to get the concurrency # , meaning how many groups run at the same time in parallel and that would be a very useful indication.How would we get the # of concurrent groups ? By measuring the execution time . Concurrent groups wont deviate that much in execution .Essentially it will tell us how to split the work load .

Files:

indexesTest2D.mq5 6 kb

OpenCL in trading Interpolation, approximation and the OpenCL: real challenges

Lorentzos Roussos 2023.05.03 13:48 #22

https://web.engr.oregonstate.edu/~mjb/cs575/Handouts/gpu101.1pp.pdf

SM/Multiprocessor = Compute Unit

Thread = Work Item

Cuda Core = Processing Element

Warp = ?

how to get info on your nVidia gear like this :

(B is the compute units , A matches the result of the querry CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS per dimension with https://www.mql5.com/en/docs/opencl/clgetdeviceinfo)

1.First go to the driver downloads page , enter your details (for your GPU) and download the latest driver , note the driver version!
2.Then go to the cuda updates listing page which shows you which cuda version is supported per the driver version you have
3.Now go to the CUDA releases page and download the appropriate CUDA version ! Do not just download the latest CUDA
4.When you install it open up cmd.exe
5.go to installation_folder\\extras\\demo_suite\\ and run deviceQuery.exe

Errors, bugs, questions Get in touch with Visual Studio 2012 not

Lorentzos Roussos 2023.05.03 15:22 #23

Hi

consider these 2 kernels

Kernel A

    string kernelA="__kernel void memtestsA(__global int* group_id,"
                                         "int iterations){"
                                         "double sum=(double)0.0;"
                                         "double inc=(double)-2.6;"
                                         "double step=(double)0.01;"
                                         "double radi=(double)0.0;"
                                         "double rstep=(double)0.001;"
                                         "for(int i=0;i<iterations;i++){"
                                         "sum+=(double)i/(double)10.0+(double)rstep;"
                                         "sum*=(((double)i)-((double)i))*(double)step;"
                                         "if(sum>0.32){sum=0.0;}"
                                         "}"
                                         "group_id[get_global_id(0)]=get_group_id(0);}";

Kernel B

    string kernelB="__kernel void memtestsB(__global int* group_id,"
                                         "int iterations){"
                                         "double sum=(double)0.0;"
                                         "double inc=(double)-2.6;"
                                         "double step=(double)0.01;"
                                         "double radi=(double)0.0;"
                                         "double rstep=(double)0.001;"
                                         "for(int i=0;i<iterations;i++){"
                                         "sum=((double)tanh(inc))*((double)sin(radi));"
                                         "inc+=step;"
                                         "radi+=rstep;"
                                         "if(inc>2.6&&step>0.0){step=(double)-0.01;}"
                                         "if(inc<-2.6&&step<0.0){step=(double)0.01;}"
                                         "if(radi>6.28&&rstep>0.0){rstep=(double)-0.001;}"
                                         "if(radi<0.0&&rstep<0.0){rstep=(double)0.001;}"                                         
                                         "}"
                                         "group_id[get_global_id(0)]=get_group_id(0);}";

Why is Kernel A not registering its private mem size but kernel B is reporting 40bytes ? If you know , thanks

2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   Kernel A : ----
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Work Group Size : 256
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Private Mem : 0
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Local Mem : 1
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   Kernel B : ----
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Work Group Size : 256
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Private Mem : 40
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Local Mem : 1

They are both for testing there is no goal in the calculations . (i expect there may be errors)

Im attaching the source code of this small test , it does not execute anything on the device.

Also , when i change kernelB from double types to float it reports 32 bytes

alternate kernel B :

    string kernelB="__kernel void memtestsB(__global int* group_id,"
                                         "int iterations){"
                                         "float sum=(float)0.0;"
                                         "float inc=(float)-2.6;"
                                         "float step=(float)0.01;"
                                         "float radi=(float)0.0;"
                                         "float rstep=(float)0.001;"
                                         "for(int i=0;i<iterations;i++){"
                                         "sum=((float)tanh(inc))*((float)sin(radi));"
                                         "inc+=step;"
                                         "radi+=rstep;"
                                         "if(inc>2.6&&step>0.0){step=(float)-0.01;}"
                                         "if(inc<-2.6&&step<0.0){step=(float)0.01;}"
                                         "if(radi>6.28&&rstep>0.0){rstep=(float)-0.001;}"
                                         "if(radi<0.0&&rstep<0.0){rstep=(float)0.001;}"                                         
                                         "}"
                                         "group_id[get_global_id(0)]=get_group_id(0);}";

report with that :

2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   Kernel A : ----
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Work Group Size : 256
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Private Mem : 0
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Local Mem : 1
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   Kernel B : ----
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Work Group Size : 256
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Private Mem : 32
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Local Mem : 1

Files:

kernel_private_mem_testor.mq5 5 kb

Trading Principles - Trade Trading Principles - Trade Build - Main menu

Lorentzos Roussos 2023.05.04 11:42 #24

OpenCL local memory , work groups , atomic and barrier

Code and Expalnation

Note : i'm allocating 8 bytes of local memory but the kernel reports 12 bytes , i still don't know why that is .

OpenCL :: Local Memory and Groups

www.mql5.com

In this blog i'm exloring how the local memory operates with regard to a work group (of work items). We create a simple kernel that will export IDs , global id, local id, group id of a work item

Files:

blog_simple_local_mem_operation.mq5 7 kb

OpenCL::Example with Image Blurring Algorithm ,Questions and Issues - page 3