OpenCL: internal implementation tests in MQL5 - page 33

 
Mathemat:

Andrew, is, say, Intel + Radeon a bad thing at all?

Not bad, just unreasonably expensive (because of the processor). :)

By the way, I'm a longtime fan of nVidia cards. I've even got a box somewhere with the legendary GeForce 3. And if I wanted to play games I would not hesitate to stick with the "green" graphics chip manufacturer.

 
Catch the post in a private message here. I don't want to bring it here.
 
MetaDriver:
On a serious note, I am awfully curious as to what juices you will be able to squeeze out of it, especially if you have 2 Giga DDR5. As it turns out, the onboard GPU memory can be a VERY serious resource for OpenCL computation.

From all the information available to me I have concluded that the main resource is the number of GPU cores. If there are not enough of them the problem becomes divided into consecutive runs of cores with new threads but it is hard to save on this resource when buying the card as the more cores the higher the price.

The second most important is the speed at which the GPU memory is running (since the memory is accessed quite frequently). GPU tasks in most cases are quite primitive and use 1-2-3 operations before accessing memory to display the results. Any complex logical operations are contraindicated for GPU, therefore programmers would strive to minimize them which would logically result in more frequent memory accesses. Here there are variants, if the task is described by the programmer in such a way that memory accesses are as low as possible, then this resource is not so important.

And the third resource, I would call the amount of GPU memory. Since crash-tests have shown that regardless of the number of concurrent contexts all distributed memory in the contexts is allocated in one memory field and does not overlap. Let me explain by example: if you have N contexts in each of which buffers are allocated in 1/4 of the device memory then you can have 4 such contexts at the same time. Fifth context, even if you create it, will not be allocated memory as it is already distributed by previous contexts. But releasing memory in any of the previous ones (simply removing buffer) will give some free space and the fifth context will work fine.

Renat:

It's early days yet - we need to make sure that OpenCL programs do not hang the whole network due to GPU glitches and OpenCL programs themselves.

As a matter of fact, OpenCL programs can only be put on the net after test runs on local agents to make sure that the program is functional and does not kill the computer.

The task of a distributed parallel computing network. The name itself may confuse an untrained reader. If you had problems with organizing a distributed network on multicore machines, now you will have problems squared away. All cores could be considered as separate network units because they perform separate tasks. But previously their speed of execution differed at most 2-3 times (for which you have introduced speed limits for slow cores), the amount of memory in most cases did not play a role, as arrays have 10^7 elements maximum (for modern machines is pennies).

But with GPU the problem changes dramatically. First of all only ~12 double arrays with length 10^7 are already 1Gb, which is a limit for many cards. In CPU computations tasks with more buffers are quite common (although of course GPU programmer can register using host memory, but it is similar to virtual RAM, in short it is a bad).

Secondly, the difference in execution speeds is linearly proportional to the number of cores in a GPU. The difference between the two cards is 10-1000 times.

In general, the task of networking comes down to classification of the program to be executed. Pay attention to the CUDA profiler. Its statistics may be taken as a basis for task classification. If the task is designed so that most of the time is spent on memory access, it requires a cluster of machines with large memory sizes, but if most of the time is spent on arithmetic, we need a cluster with large numbers of cores. Clusters can be flexible or includeable (this is a matter of execution).

Although the task is simplified a bit by the unification applied by time itself. A card with 12 cores is likely to have 256MB, a card with 96 has 512MB. On average, manufacturers do not allow large imbalances (as opposed to CPU, where user can stick his old rock with RAM up to the bummer, or put a minimum of RAM on the new rock, just to save money when buying).

Although, in my opinion, a more correct approach would be to create a debugger for OpenCL and on its basis defend optimization for the device in bytecode. Otherwise we would come to assembler when the programmer would have to guess on what card the program would run and average features of the program for a possible environment.

 
MetaDriver:

Tell me, if you don't mind, how do you run the test? Where, what to change? Copying, selecting, result:

Win7 x64 build 607

 
WChas:

This example does not need to be "run" in the tester. To run the script, drag and drop it from the "Navigator" to the chart. The result will be displayed in the " Tools" panel, " Experts" tab.

 

w7 32 bit 4GB ( 3.5GB available)

Intel Core 2 Quad Q9505 Yorkfield (2833MHz, LGA775, L2 6144Kb, 1333MHz) vs Radeon HD 5770

 
Snaf:

w7 32 bit 4GB ( 3.5GB available)

Intel Core 2 Quad Q9505 Yorkfield (2833MHz, LGA775, L2 6144Kb, 1333MHz) vs Radeon HD 5770

Cool, now you know where to look... :)
 
MetaDriver:
Cool, now you know where to dig... :)

the processors are already 2-3 generations behind

and video 5770 - 6770 -7770

:)

 
Urain:

From available information I came to a conclusion that the main resource is the number of GPU cores, in case of lack of them the problem gets solved by consecutive runs of cores with new threads, but it is hard to save on this resource when buying a card as the more cores the higher the price.

The second most important is the speed at which the GPU memory is running (since the memory is accessed quite frequently). GPU tasks in most cases are quite primitive and use 1-2-3 operations before accessing memory to display the results. Any complex logical operations are contraindicated for GPU, therefore programmers would strive to minimize them which would logically result in more frequent memory accesses. Here there are variants, if the task is described by the programmer in such a way that memory accesses are as low as possible, then this resource is not so important.

And the third resource, I would call the amount of GPU memory. Since crash-tests have shown that regardless of the number of concurrent contexts all distributed memory in the contexts is allocated in one memory field and does not overlap. Let me explain by example: if you have N contexts in each of which buffers are allocated in 1/4 of the device memory then you can have 4 such contexts at the same time. Fifth context, even if you create it, will not be allocated memory as it is already distributed by previous contexts. Although by freeing memory in some of the previous ones (simply removing buffer) some space will appear and the fifth context will work fine.

Nikolai, I agree with you about individual hierarchy of values. But concerning the cloud... the problem is in memory. If the first two resources are not enough on the cloud machine, the client program will just slow down. If there is a lack of GPU-memory, it can simply crash. I.e. if the driver fails to distribute the buffer, that is half the trouble. It is a misfortune if there is actually enough memory, but not enough left over for the rest of the GPU contexts (including the system ones). Then the driver simply bursts (as practice has shown). Perhaps, this is just a flaw in the driver software but as long as it exists, it would be better not to let OpenCL programs into the cloud. Remote agents can, but the cloud should not.
 

after upgrading to build 607 I suddenly got opencltest working on my laptophttps://www.mql5.com/ru/code/825, it didn't work before (about two weeks ago), I think it said "OpenCL not found"

"I smell a trick", haven't messed around with Mandelbrot fractals yet ))))))))))))) , but it's still nice that not a new laptop can come in handy for full MT5 testing

OpenCL Test
OpenCL Test
  • votes: 10
  • 2012.02.07
  • MetaQuotes Software
  • www.mql5.com
Небольшой рабочий пример расчета фрактала Мандельброта в OpenCL, который кардинально ускоряет расчеты по сравнению с софтверной реализацией примерно в 100 раз.