OpenCL::Example with Image Blurring Algorithm ,Questions and Issues - page 3

 

2D mapping of work groups and work items . What it does :

it creates a fake work space of x by y dimensions .

It then logs all the coordinates that arrive in the kernel , the global ids the local ids and the work group ids.

The result is sent to a file after the kernel completes . 

From a first glance it appears that the first dimension , x , determines the number of the work items in the work groups and , the second , the number of the works groups .

Now there is a maximum number of work items per work group (1024) in my gpu (don't read the mql5 docs on this they are wrong)  and a 256 available work groups for the kernel (if that documentation is correct)

So , i can now test if the docs are correct by creating a -very annoying- 3D test and if it fails above 256 available groups then the docs are right on this one . 

here is a snapshot of the output 

Item[511]:GLOBAL.ID.X(15):GLOBAL.ID.Y(31):LOCAL.ID.X(15):LOCAL.ID.Y(0):GROUP.ID.X(0):GROUP.ID.Y(31):

GLOBAL.ID.X is get_global_id(0)
GLOBAL.ID.Y is get_global_id(1) (1 means dimension 2)
LOCAL.ID.X is get_local_id(0)
LOCAL.ID.Y is get_local_id(1)
GROUP.ID.X is get_group_id(0)
GROUP.ID.Y is get_group_id(1)

it appears local id y and group id x are not used


So the output is telling us that the "slot" at position 511 was processed in group 31 by work item 15

Edit : if i supply the additional local_size_array in CLExecute and set it to 4,4 this happens :

Item[511]:GLOBAL.ID.X(15):GLOBAL.ID.Y(31):LOCAL.ID.X(3):LOCAL.ID.Y(3):GROUP.ID.X(3):GROUP.ID.Y(7):

Now we can investigate what on earth that does.

neat

Edit2 the  CL_KERNEL_WORK_GROUP_SIZE matches the number of items openCL divides the full work load by . For instance , i get that value to be 256 (for the 1D kernel) .
The docs say ,lets skip the what but look at the that are  , "that are available for the kernel" .
So to simplify the observation , on the 1D test , i throw 512 items it splits in 2 groups , 1024 items it splits in 4 groups ,2560 it splits in 10 groups and so on.
Now if i start increasing the items per that value and run the kernel each time i will be able to get the concurrency # , meaning how many groups run at the same time in parallel and that would be a very useful indication.How would we get the # of concurrent groups ? By measuring the execution time . Concurrent groups wont deviate that much in execution .Essentially it will tell us how to split the work load . 

Files:
 

https://web.engr.oregonstate.edu/~mjb/cs575/Handouts/gpu101.1pp.pdf

SM/Multiprocessor = Compute Unit

Thread = Work Item 

Cuda Core = Processing Element

Warp = ?

how to get info on your nVidia gear like this : 

(B is the compute units , A matches the result of the querry CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS per dimension with https://www.mql5.com/en/docs/opencl/clgetdeviceinfo)

  • 1.First go to the driver downloads page , enter your details (for your GPU) and download the latest driver , note the driver version!
  • 2.Then go to the cuda updates listing page which shows you which cuda version is supported per the driver version you have 
  • 3.Now go to the CUDA releases page and download the appropriate CUDA version ! Do not just download the latest CUDA
  • 4.When you install it open up cmd.exe
  • 5.go to installation_folder\\extras\\demo_suite\\ and run deviceQuery.exe

 

Hi 

consider these 2 kernels 

Kernel A 

    string kernelA="__kernel void memtestsA(__global int* group_id,"
                                         "int iterations){"
                                         "double sum=(double)0.0;"
                                         "double inc=(double)-2.6;"
                                         "double step=(double)0.01;"
                                         "double radi=(double)0.0;"
                                         "double rstep=(double)0.001;"
                                         "for(int i=0;i<iterations;i++){"
                                         "sum+=(double)i/(double)10.0+(double)rstep;"
                                         "sum*=(((double)i)-((double)i))*(double)step;"
                                         "if(sum>0.32){sum=0.0;}"
                                         "}"
                                         "group_id[get_global_id(0)]=get_group_id(0);}";

Kernel B 

    string kernelB="__kernel void memtestsB(__global int* group_id,"
                                         "int iterations){"
                                         "double sum=(double)0.0;"
                                         "double inc=(double)-2.6;"
                                         "double step=(double)0.01;"
                                         "double radi=(double)0.0;"
                                         "double rstep=(double)0.001;"
                                         "for(int i=0;i<iterations;i++){"
                                         "sum=((double)tanh(inc))*((double)sin(radi));"
                                         "inc+=step;"
                                         "radi+=rstep;"
                                         "if(inc>2.6&&step>0.0){step=(double)-0.01;}"
                                         "if(inc<-2.6&&step<0.0){step=(double)0.01;}"
                                         "if(radi>6.28&&rstep>0.0){rstep=(double)-0.001;}"
                                         "if(radi<0.0&&rstep<0.0){rstep=(double)0.001;}"                                         
                                         "}"
                                         "group_id[get_global_id(0)]=get_group_id(0);}";  

Why is Kernel A not registering its private mem size but kernel B is reporting 40bytes ? If you know , thanks 

2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   Kernel A : ----
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Work Group Size : 256
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Private Mem : 0
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Local Mem : 1
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   Kernel B : ----
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Work Group Size : 256
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Private Mem : 40
2023.05.03 16:18:12.740 kernel_private_mem_testor (USDJPY,H1)   --------Local Mem : 1

They are both for testing there is no goal in the calculations . (i expect there may be errors)

Im attaching the source code of this small test , it does not execute anything on the device.

Also , when i change kernelB from double types to float it reports 32 bytes

alternate kernel B :

    string kernelB="__kernel void memtestsB(__global int* group_id,"
                                         "int iterations){"
                                         "float sum=(float)0.0;"
                                         "float inc=(float)-2.6;"
                                         "float step=(float)0.01;"
                                         "float radi=(float)0.0;"
                                         "float rstep=(float)0.001;"
                                         "for(int i=0;i<iterations;i++){"
                                         "sum=((float)tanh(inc))*((float)sin(radi));"
                                         "inc+=step;"
                                         "radi+=rstep;"
                                         "if(inc>2.6&&step>0.0){step=(float)-0.01;}"
                                         "if(inc<-2.6&&step<0.0){step=(float)0.01;}"
                                         "if(radi>6.28&&rstep>0.0){rstep=(float)-0.001;}"
                                         "if(radi<0.0&&rstep<0.0){rstep=(float)0.001;}"                                         
                                         "}"
                                         "group_id[get_global_id(0)]=get_group_id(0);}";  

report with that :

2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   Kernel A : ----
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Work Group Size : 256
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Private Mem : 0
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Local Mem : 1
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   Kernel B : ----
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Work Group Size : 256
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Private Mem : 32
2023.05.03 16:23:45.794 kernel_private_mem_testor (USDJPY,H1)   --------Local Mem : 1
 

OpenCL local memory , work groups , atomic and barrier 

Code and Expalnation

Note : i'm allocating 8 bytes of local memory but the kernel reports 12 bytes , i still don't know why that is .
OpenCL :: Local Memory and Groups
OpenCL :: Local Memory and Groups
  • www.mql5.com
In this blog i'm exloring how the local memory operates with regard to a work group (of work items). We create a simple kernel that will export IDs , global id, local id, group id of a work item