OpenCL: internal implementation tests in MQL5 - page 3

 
Renat:
There are clarifications:
...
2) We support OpenCL 1.1 and above because it supports double types. OpenCL 1.0 version can only operate with float, whose accuracy is in no way suitable for financial calculations

Try installing newer drivers although many cards from previous generations do not support double operations.
I hope 1.1 float still supports float as well. In general float is not enough but in many particular cases it is very enough. And extra memory is often expensive, especially for parallel computations.
 
Graff:
While testingJavaDev scripts this summer,we ran into a problem that my graphics card didn't supportdouble, butit workedwithfloat. Drivers can't fix it, we need to change the card :(
All processor cores will be used if the graphics card is not supported.
 

Put up Catalyst Centre 12, previously it was 11. Already got results (highlighted in red):

Just "played around" with drop-down list "0" to "1" and the ATI icon appeared. It didn't disappear again at this start of the program. A small glitch in the software seems to be....

 
WChas:
Cool! I have MSI R6970 - 1536 threads (agents) and Gigabyte HD5870 (1600 processors). In BOINC manager they can be used without combining them in crossfair (just plug one output of the second card or connect one of the outputs of the second monitor. Question: will it be possible to use them both without crossfire???

Not sure yet. It depends more on the program itself which selects the devices, which entails using each card in the code explicitly.

We'll see later when we release the first beta.

 
MetaDriver:
I hope 1.1 float is still supported too? Float is not enough in the general case, but in many special cases it is even very enough. And extra memory is often very expensive, especially in parallel calculations.
In OpenCL 1.1, float is of course supported.
 
WChas:
If I understand correctly, 1 GPU is one very powerful agent? In that case, could CPU agents be disabled (due to their low speed relative to video)?

The CPU core cannot be disabled in any way, it is used as a host platform in MQL5 environment anyway.

It is important to understand that GPU cores are "a swarm of highly specialized bees" as compared to "workhorses" of CPU cores. GPU cores can by no means serve as a substitute for cores of a conventional CPU.

If an EA developer is able to parallelize the task into hundreds and thousands of independent threads, the GPU will give 10-100x speedup. But in most single tasks (trading and indicators, for example), the calculations are sequential and there is no chance to effectively parallelize the processes. Also, using double-precision real maths on regular GPUs results in speeds half as fast as the top values achieved on float. Here is an interesting link with discussion of speed dropdown on double: http://www.gpgpu.ru/node/901

We ourselves will buy cards of different classes to publish a small research for calculations of some tasks on double and float. We ourselves are interested in what we can get.

Here is a small study of calculation speeds with float and double on different devices (full report: http://agora.guru.ru/hpc-h/files/017_Krivov_NvidiaGpuComparision.pdf). You can see that double speed on Tesla is 2 times lower than peak speed of float, while on common cards like GeForce 480 GTX it is almost 10 times slower. Actually it means that it is possible to get even better results on regular 4-8 core CPU with active usage of SSE2-4 and AVX in double maths:


We were able to get hundredfold (in the limit 100-1000 and more) speedups on sequential tests in the trading strategies optimizer and tenfold (in the limit 10-20-50-100-200 times) speedups in genetics due to the native idea of parallel test runs and MQL5 Cloud Network. But when it comes to parallelism within a single task, all the efforts fall on the shoulders of the GPU programmer who can intelligently distribute the task among hundreds or thousands of independent simple cores.


There is another nuance - most likely when using multiple agents on the computer, only one agent will be allocated the right to use the GPU. Or, if there are a few real physical GPU devices, they will be distributed between the agents at the start of agents.

The point is that it is unwise to divide one physical device across multiple agents, as the resulting sharing results in a non-linear drop in final performance. In other words, 4 agents on one GPU would perform many times worse than 1 agent on one GPU. Our internal tests have shown this.

We will perform more detailed tests and come up with a solution that maximizes the result.

Кофигурацыя системы на базе Tesla C2050/C2070 (C2075) | GPGPU.ru
  • www.gpgpu.ru
Здравствуйте! Возникла задача внедрения GPU-Computing для решения задач численного моделирования (симуляция и оптимизация фабрик полупроводников с помощью эвристик, докторская работа). Считать действительно много, и есть довольно хороший потенциал для распараллеливания расчетов, никакой роботы с видео, рендерингом и прочим. Помогите, пожалуйста...
 
 
 
Thanks for the clarification.
Renat:

........

We have been able to get hundredfold (in the limit 100-1000 and more) speedups on sequential searches in Strategy Trading Optimizer and tenfold (in the limit 10-20-50-100-200 times) speedups on genetics due to the native idea of parallel test runs and MQL5 Cloud Network. But when it comes to parallelism within a single task, all efforts will be on the shoulders of the GPU programmer who can reasonably distribute the task among hundreds or thousands of independent simple cores.

........

For me personally, it's the acceleration in the optimizer that's important. So I'm looking forward to the necessary update + performance table of different video cards.
 

Re-posted from the OpenCL thread at MQL4.com:

https://www.mql5.com/ru/forum/137422/page6

It's not that simple.

Besides, Rinat is confusing people: OpenCL 1.0 can work fine with double float, it's an "openly available option" supported by all manufacturers - BUT STILL NOT FOR ALL OLD MAPS.

http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/

"Optional Double Precision and Half Floating Point

OpenCL 1.0 adds support for double precision and half floating-point as optional extensions. The double data type must confirm to the IEEE-754 double precision storage format.

An application that wants to use double will need to include the #pragma OPENCL EXTENSION cl_khr_fp64 : enable directive before any double precision data type is declared in the kernel code. This will extend the list of built-in vector and scalar data types to include the following:.....".

It can work in the Strategy Tester but it won't work in the OpenCL 1.0 Expert Advisor not because, as Rinat says, "there is no double float there" but, as I've already mentioned, because there is no safe threading in OpenCL 1.0 and there is no threading in MT4-5.

OpenCL (and CUDA) has a lot of confusion in general. What do you want? After all, the iron and radio engineers set out to change the concept of programming. They have an iron mindset.

There will also be a problem with the so-called PLATFORM selection: the program, that is, MT or DLL or Expert Advisor, MUST, just MUST manually select the platform (AMD, Nvidia, Intel), which can be several different on a computer and will run the OpenCL kernel, and then manually select DEVICE if the computer has Multi-GPU. The platform auto-selection in OpenCL is not there yet. Rinat talks about "auto-select the most powerful" but I don't know how it is. In the example shown there, there is no platform selection and no device selection (GPU, CPU).

Furthermore, there is as yet no automatic OpenCL parallelization of tasks on several GPUs or on GPU+CPU in the standard. Let us put it this way: in some versions of its drivers/SDK AMD used to introduce such autoprovisioning but had some problems and for the time being AMD has switched this feature off.

Bottom line: to develop and enable OpenCL programs c/o MT4-5 requires some manual effort and therefore will not work automatically or by "recompiling with option". Which in turn is fraught with a lot of stalls in real-world operation. It will be a fine work and, what is important, I allow myself to repeat, unfortunately it is JEWISH ORIENTED PROGRAMMING, which is wrong. Debugging parallel programs for CUDA or OpenCL turned out to be much more difficult than the iron revolutionaries assumed. Nvidia even cancelled their fall 2011 CUDA conference - due to driver issues and a lot of complaints about stalling debugging. So, they added another 1000 new features to the latest Toolkit - and what to do with it if the simplest programs don't even run or run with interruptions? After all, they haven't even mentioned half of the internal mechanism of OpenCL or CUDA in their descriptive tools.

GPU speed (in GigaFLOPS) of a hanging video card due to driver or software compatibility is zero.

OpenCl и инструменты для него. Отзывы и впечатления. - MQL4 форум
  • www.mql5.com
OpenCl и инструменты для него. Отзывы и впечатления. - MQL4 форум