OpenCL :: Exploring the 1st dimension (part 3 : correct benchmark)

OpenCL :: Exploring the 1st dimension (part 3 : correct benchmark)

3 May 2023, 03:13
Lorentzos Roussos
0
143

Read Part 1 

Read Part 2 

Run the test couple of times , we get the same execution time across many iterations that means something is wrong or we are getting cached responses.

And that is the timer function , let's see if it works , it does :

2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1)    Work Items (1) Iterations(1000)
2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1)    Work Groups (1)
2023.05.03 02:45:59.711 blog_benchmark_2 (USDJPY,H1)    Milliseconds (15)

Okay now , what are we looking for ? 

Whatever is done in parallel will be close in time right ? so we are looking for that extra delay that is not tiny compared to the overall execution 

So let's start 2048 items and we'll increase iterations so that we can notice times let's try 1 million again . 

Actually lets send 1 item and 1million iterations .

2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1)    Work Items (1) Iterations(1000000)
2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1)    Work Groups (1)
2023.05.03 02:49:57.474 blog_benchmark_2 (USDJPY,H1)    Milliseconds (16)

Awesome ,1 items 100million iterations

2023.05.03 02:51:17.223 blog_,benchmark_2 (USDJPY,H1)    Work Items (1) Iterations(100000000)
2023.05.03 02:51:17.223 blog_benchmark_2 (USDJPY,H1)    Work Groups (1)
2023.05.03 02:51:17.223 blog_benchmark_2 (USDJPY,H1)    Milliseconds (16)

Okay changed kernel again to this : 

    string kernel="__kernel void memtests(__global int* group_id,"
                                         "int iterations){"
                                         "double sum=(double)0.0;"
                                         "double inc=(double)-2.6;"
                                         "double step=(double)0.01;"
                                         "double radi=(double)0.0;"
                                         "double rstep=(double)0.00001;"
                                         "for(int i=0;i<iterations;i++){"
                                         "sum=((double)tanh(inc))*((double)sin(radi));"
                                         "inc+=step;"
                                         "radi+=rstep;"
                                         "if(inc>2.6&&step>0.0){step=(double)-0.01;}"
                                         "if(inc<-2.6&&step<0.0){step=(double)0.01;}"
                                         "if(radi>6.28&&rstep>0.0){step=(double)-0.00001;}"
                                         "if(radi<0.0&&rstep<0.0){step=(double)0.00001;}"                                         
                                         "}"
                                         "group_id[get_global_id(0)]=get_group_id(0);}";

Dropped to 10000 iterations , in general don't overdo it because you may be getting cache responses and the moment you change the calculation let's say you'll hear your gpu complain.

So 10000 iterations :

2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1)    Work Items (1) Iterations(10000)
2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1)    Work Groups (1)
2023.05.03 03:07:20.865 blog_benchmark_2 (USDJPY,H1)    Milliseconds (31)

lets increase to 100K 

i'll get 31 ms again i'm sure but this time i'll test it after a restart too .

So 1 item 100K iterations no restart of the pc.

2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1)    Work Items (1) Iterations(100000)
2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1)    Work Groups (1)
2023.05.03 03:11:24.106 blog_benchmark_2 (USDJPY,H1)    Milliseconds (47)

Ok and 1 item 100K iterations restart of the pc.

2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1)    Work Items (1) Iterations(100000)
2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1)    Work Groups (1)
2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1)    Milliseconds (47)

same time yay okay , we are good with this kernel FILALLY aand look who decided to show up :

2023.05.03 03:13:26.448 blog_benchmark_2 (USDJPY,H1)    Kernel private mem (40)

the private memory measurement stopped being zero for the first time . Okay so thats 40 what ? 40 bytes per item ?

Lets measure the kernel , we count non globals and non locals and non arguments so :

yeah 5 doubles 5 by 8 its 40 bytes . Okay so that works too . Awesome.

Private memory is super fast you don't worry for it .How we measure the limit that's a later question.

Alright so , can it take 1million iterations now ?

2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1)    Work Items (1) Iterations(1000000)
2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1)    Work Groups (1)
2023.05.03 03:18:41.921 blog_benchmark_2 (USDJPY,H1)    Milliseconds (203)

Okay but is this number something that will allow us to spot the "fold" ?

Let's see , lets start increasing the items , i wont paste the logs i'll just list them

items# Groups# ms
1 1 203
100 1 203
256 1 188
512 2 297
1024 4 578
2048 8 1125
4096 1 2235

Hmm what went down there ? we went from 8 groups to 1 group and the screen flashed

So 1024 is the max items per compute unit then and 256 is the max group size that indicates 4 1024 concurrent processes and 4 groups that can be working in the same memory ? (for 1D assuming we do not do any splitting ourselves)

We must remember that we are letting it decide on the splits on this benchmark and it chose 1 group , or it did not do anything in parallel here and the file we output confirms this 

GROUP.ID[4095]=0

the last entry was 0 . So it calculated nothing and glitched probably.

Sooo looking at the figures above i don't think any concurrency is there , if 4 groups executed in parallel or 2 groups or 8 groups we'd see the same time more or less with 1 group right ? So what's running in parallel is the 256 items maybe.

Hmm , so now let's introduce another parameter into our test and call it benchmark 3 . The local dimensions parameter in the execution call 

bool  CLExecute(
   int          kernel,                   // Handle to the kernel of an OpenCL program
   uint         work_dim,                 // Dimension of the tasks space
   const uint&  global_work_offset[],     // Initial offset in the tasks space
   const uint&  global_work_size[],       // Total number of tasks
   const uint&  local_work_size[]         // Number of tasks in the local group
   );

 what if i set this to 32 ? we can get 8 sub groups .

Those 8 subgroups will execute at the same time but if i we deploy 10 sub groups (i.e 2 groups) we will get a higher execution time ?

I think i don't need to test this but let's confirm .

It's my understanding i have a device that can calculate 256 items at the same time and this is why the integer 

CL_MAX_WORK_GROUP_SIZE returns this number too the group size can be 256 because 256 that run at the same time can share the local memory .

Could it be that simple ?

Let's test how the groups split if i set the parameter local_work_size to 32 (matching the warp/wavefront) for different # of items

    uint work_items_per_group[]={32};
    long msStarted=GetTickCount();
    CLExecute(ker,1,offsets,works,work_items_per_group);

I don't think i need to change anything else .

Let's see the number of groups it creates

(sidenote the 4096 did indeed crash as i got a context error on the next run needing a restart)

items groups
1 none , error okay ofcourse
32 1 (203ms)
64 2 (203ms)
128 4 (203ms)
256 8 (203ms)
512 16 (297ms)
1024 32 (578ms)
2048 64 (1125ms)
4096 1 (2234ms)

Okay so , obviously the 

long  CLGetInfoInteger(ctx,CL_DEVICE_MAX_WORK_GROUP_SIZE);

which returns 1024 in my device , refers to the maximum parallel items that can run at the same time .

That means if you set the local size to 1024 it wont run 1024 items in parallel it will split it to packs of 1024 and it will likely fail and slow down but if you set it to 1 it will treat each work item as its own group , right ? 

So for this test which does not need groups , if i set the local size to 1 do i get the fastest execution if the items are dead on 1024 ? 

Nope , i get a glitch again . What if the total items is 256 ?

nope , glitch ... ermm multiples of 2 perhaps ? 

Nnnope , glitch ...so what on earth is going on again ?

What if i have 8 items and 1 local , so 1 item per group 8 groups essentially 

No glitch , so there is a limit in the number of groups but it has nothing to do with this : 

CL_DEVICE_MAX_WORK_GROUP_SIZE

And the way to discover the group limit is by looking for the drop off from the incrementing # of groups to 1 followed by the screen glitching ???

But it also tolerated 2048 work items so , i think we are seeing the non - communication between the OpenCL and the hardware . 

Is it 8 per dimension though , so 8^3 ? 512 ? or the cubic root of 1024 (the max items ? ) hmmm

thats ~10 so 10 groups max ? (per dimension i assume) 

How we'd know ? 320 items with 32 local would work probably

It did okay but 64 worked too 2048 items with 32 local ... anyway .. im demolished i'll pick it up an other time.

I'm attaching the benchmarks...

Also there is something to take from all these tests ,and its that the value 

CLGetInfoInteger(ker,CL_KERNEL_WORK_GROUP_SIZE);

Is telling you how many kernel instances can run in parallel in this device (or per compute unit but that can be tested with the cpu)