OpenCL: real challenges - page 8

 
Mathemat:

Look at your own code: And then, in the last line, you yourself divide 240 by 18 (that's units for your card).

You're obviously confused about something. Here's the controversial piece:

   double price[30];

   uint units = (uint)CLGetInfoInteger(hcontext, CL_DEVICE_MAX_COMPUTE_UNITS);
   uint global_work_offset[] = {0};
   uint global_work_size[1];
   uint local_work_size[1];
   global_work_size[0] = ArraySize(price); /// <-- здесь же НЕ sizeof(price) вообще-то!
   local_work_size[0] = global_work_size[0] / units;

Print("global=",global_work_size[0]," local=",local_work_size[0]);

Conclusion: global=30 local=1

And 240 bytes is exactly when you create the buffer.

 
Roffild:

You're obviously confused about something. Here's a controversial piece:

Output: global=30 local=1

And 240 bytes exactly when creating the buffer.

global = 240. Print out
global_work_size[0]

And local_work_size[0] = (uint) 240/18 = 13

P.S. Yes, you got it right. Pardon. I got a bit confused.

local_work_size[0] = (uint) 30/18 = 1. And I have the same, since units=28.

 

Again, Roffild:

Mathemat: Давай тупо прикинем. 18 задач, выполняемых одновременно на мухах GPU, - это максимум то, что можно сделать на 4-5 нитках CPU. А CPU на x86 эмуляции может организовать гораздо больше ниток. Во всяком случае, если это Intel. Мой бывший Pentium G840 (2 ядра) дал ускорение примерно в 70 раз - на двух unit'ах! Я уже не говорю о том, что вытворяет мой текущий... условно говоря, i7.

A well-parallelized task (see MetaDriver's scripts from the first ocl thread) can achieve speedups up to 1000 or more on GPU (compared to 1-thread execution on CPU on MQL5). If you can't find it - I can send it to you, you can test it on your card.

 

Have you figured out the buffer and its speed?

You'd better use AMD CodeXL to figure out UNITS etc. - it has nice performance graphs.

AMD CodeXL itself is glitchy but it is difficult to draw any conclusions without it.

I'm not going to use OpenCL until the tester allows me to use CPU or until I run a task that lasts longer than Number of_buffers * 0.353 msec.

P.S.

I did finish optimizing my code and the final variant passes the test in 33 seconds (320 seconds - before optimization, 55 seconds - "OpenCL-style").

 
Roffild: Have you figured out the buffer and its speed?

There's nothing to figure out. It is clear that it is a slow operation. Conclusion - increase work inside the kernel (there is too little of it in your code).

And buy a more modern video card, it seems to have become better with it.

AMD CodeXL itself is glitchy but it's hard to draw any conclusions without it.

Intel's utility is rather useful too, but for Intel stones. Well, and for catching the most obvious errors in the kernel.

P.S. I have, after all, finished optimizing my code and the final variant passes the test in 33 seconds (320 seconds - before optimization, 55 seconds - "OpenCL-style").

It's already much better.

OpenCL: от наивного кодирования - к более осмысленному
OpenCL: от наивного кодирования - к более осмысленному
  • 2012.06.05
  • Sceptic Philozoff
  • www.mql5.com
В данной статье продемонстрированы некоторые возможности оптимизации, открывающиеся при хотя бы поверхностном учете особенностей "железа", на котором исполняется кернел. Полученные цифры весьма далеки от предельных, но даже они показывают, что при том наборе возможностей, который имеется здесь и сейчас (OpenCL API в реализации разработчиков терминала не позволяет контролировать некоторые важные для оптимизации параметры - - в частности, размер локальной группы), выигрыш в производительности в сравнении с исполнением хостовой программы очень существенен.
 

Today I needed to generate an array with 1 bit in numbers.

At the same time I practiced with OpenCL.

I'm posting the code as a demonstration of an interesting method to calculate global_work_size and local_work_size. The idea itself is taken from IntrotoOpenCL.pdf (I have a copy), but I tweaked it.

void OnStart()
{
   const string source =
      "kernel void differ(const int sizearray, const int bits, global uchar *result)        \r\n"
      "{                                                                                    \r\n"
      "   size_t global_index = get_global_id(0);                                           \r\n"
      "   if (global_index >= sizearray) return; // проверка границ, когда work>arraysize   \r\n"
      "   size_t xor = global_index;                                                        \r\n"
      "   uchar secc = 0;                                                                   \r\n"
      "   for (int bit = bits; bit>-1; bit--)                                               \r\n"
      "     if ((xor & ((size_t)1 << bit)) > 0) secc++;                                     \r\n"
      "   result[global_index] = secc;                                                      \r\n"
      "}                                                                                    \r\n"
   ;
   
   int hContext = CLContextCreate();
   string build_log = "";
   int hProgram = CLProgramCreate(hContext, source, build_log);
   Print("Error = ",build_log);
   int hKernel = CLKernelCreate(hProgram, "differ");
   
   uchar alldiff[1 << 17] = {0};
   CLSetKernelArg(hKernel, 0, ArraySize(alldiff));
   CLSetKernelArg(hKernel, 1, 17 /*bits*/);
   int hBuffer = CLBufferCreate(hContext, sizeof(alldiff), CL_MEM_WRITE_ONLY);
   CLSetKernelArgMem(hKernel, 2, hBuffer);
   CLBufferWrite(hBuffer, alldiff);
   
   /*uchar group_size[1024] = {0};
   uint deviceinfo_size = 0;
   CLGetDeviceInfo(hContext, CL_DEVICE_MAX_WORK_GROUP_SIZE, group_size, deviceinfo_size);
      for (int x = deviceinfo_size; x>=0; x--) Print(group_size[x]);
      Print("ch ",CharArrayToString(group_size));
   */ ///// CLGetDeviceInfo возвращает массив битов (шо за бред?)
   uint group_size = 256;
   
   uint units = (uint)CLGetInfoInteger(hContext, CL_DEVICE_MAX_COMPUTE_UNITS);
   uint global_work_offset[] = {0};
   uint global_work_size[1];
   uint local_work_size[1];
   global_work_size[0] = ArraySize(alldiff);
   local_work_size[0] = global_work_size[0] / units;
   if (local_work_size[0] < 1) local_work_size[0] = 1;
   if (local_work_size[0] > group_size) local_work_size[0] = group_size;
   if (global_work_size[0] % local_work_size[0] != 0)
   {
      // увеличиваем global, чтобы global % local == 0
      // в самом kernel проверяется выход за границы
      global_work_size[0] = (int(global_work_size[0] / local_work_size[0]) +1) * local_work_size[0];
      // объяснение в
      // http://wiki.rac.manchester.ac.uk/community/OpenCL?action=AttachFile&amp;do=get&target=IntrotoOpenCL.pdf
   }
      Print("work=", global_work_size[0], " local=", local_work_size[0], " group=", group_size);
   bool exec = CLExecute(hKernel, 1, global_work_offset, global_work_size, local_work_size); // async
   if (exec == false) Print("Error in ",__FUNCSIG__," CLExecute: ",GetLastError());

   CLBufferRead(hBuffer, alldiff);
   
   int hDump = FileOpen("alldiff.diff", FILE_ANSI|FILE_WRITE);
   for (int x = 0, xcount = ArraySize(alldiff); x < xcount; x++)
      FileWriteString(hDump, (string)alldiff[x]+",");
   FileClose(hDump);
   
   CLBufferFree(hBuffer);
   CLKernelFree(hKernel);
   CLProgramFree(hProgram);
   CLContextFree(hContext);
}