OpenCL in trading

 

OpenCL is a framework that provides an open standard for writing programs that can run on different types of hardware platforms, such as CPUs, GPUs, and specialized processing units. It allows software developers to write code in a single language that can be executed on multiple devices, regardless of their vendor or architecture.

OpenCL comprises a runtime and programming interface that offer a level of platform independence, allowing developers to write code that can be executed on any OpenCL-enabled device. Moreover, it provides a set of low-level APIs that enable developers to control the device, memory, and kernel execution explicitly, giving them fine-grained control over their applications.

OpenCL has extensive applications in scientific computing, image and video processing, machine learning, and other domains. It enhances the performance of applications by utilizing the parallel computing power of multiple devices, enabling faster and more efficient execution.

One of the most significant advantages of OpenCL is its ability to utilize the computing power of GPUs, which can perform specific types of calculations much faster than CPUs. This makes it particularly useful for applications that involve heavy calculations, such as scientific simulations, image and video processing, and machine learning.

Overall, OpenCL provides a flexible framework for developing applications that can leverage the power of different types of computing devices, making it a valuable tool for developers working on high-performance computing applications.


MQL5 supports OpenCL since 2012, for the deails see chapter Working with OpenCL of MQL5 Reference. See also Class for Working with OpenCL programs.

The examples of OpenCL use can be found in MQL5\Scripts\Examples\OpenCL.

OpenCL examples in MetaTrader5

Here is a Seascape example of OpenCL




See also articles:

Documentation on MQL5: Working with OpenCL
Documentation on MQL5: Working with OpenCL
  • www.mql5.com
Working with OpenCL - MQL5 Reference - Reference on algorithmic/automated trading language for MetaTrader 5
 

Introduction to OpenCL



Introduction to OpenCL (1)

The Introduction to OpenCL video discusses OpenCL as a low-level language for high-performance heterogeneous data-parallel computation, supporting multiple types of devices, including CPUs, GPUs, and FPGAs. OpenCL became an open standard in 2008 and has received significant industry support from companies such as Intel, Nvidia, and AMD. While OpenCL is often compared to CUDA, which has better tools, features, and support from Nvidia, OpenCL supports more devices, making it more widespread across manufacturers. For personal projects, the speaker suggests using CUDA for its better tools and optimization, while OpenCL is recommended for professional products that need to support different GPUs.

  • 00:00:00 In this section, the speaker introduces OpenCL as a low-level language for high-performance heterogeneous data-parallel computation. OpenCL can support multiple types of devices, including CPUs, GPUs, and FPGAs, and is based on C 99, allowing for portability across devices. OpenCL also provides a consistent way to express vectors and has shared math libraries and an OpenCL certification process that ensures guaranteed precision. The speaker notes that OpenCL became an open standard in 2008, receiving significant industry support from companies such as Intel, Nvidia, and AMD, as well as embedded device makers like Ericsson, Nokia, and Texas Instruments. While OpenCL is often compared to CUDA, which has better tools, features, and support from Nvidia, OpenCL supports more devices, making it more widespread across manufacturers.

  • 00:05:00 In this section, the speaker discusses the differences between CUDA and OpenCL and when to choose one over the other for different purposes. For personal projects, the speaker suggests using CUDA for its better tools, debuggers, and optimizations. However, for professional products that need to support different GPUs, the speaker recommends using OpenCL as it is the only way to support non-Nvidia GPUs and is also evolving with the support of several companies. When it comes to the course, the speaker suggests using CUDA for the better tools and streamlined coding, but OpenCL may be easier to use to tap into all computing resources.
Introduction to OpenCL (1)
Introduction to OpenCL (1)
  • 2016.04.06
  • www.youtube.com
Introduction to OpenCL: What is it, what is it good for, how does it compare to CUDA.
 

What is OpenCL Good for?



What is OpenCL Good for? (2)

The speaker in the video talks about the advantages of using OpenCL for computationally intensive programs that are data parallel and single precision. GPUs are designed for graphics and are ideal due to the high proportion of math operations to memory operations. The speaker explains that higher intensity loops spend more time doing math operations, where GPUs excel, while low intensity loops spend most of their time waiting for memory access. Data parallelism, which involves performing the same independent operations on lots of data, is also explored in this section. The speaker also discusses the use of single and double precision in OpenCL, where double precision is more expensive to execute due to it requiring twice as much data as single precision.

  • 00:00:00 In this section, the speaker explains that OpenCL is good for computationally intensive programs that are data parallel and single precision. GPUs are designed for graphics and are good for these types of programs because they are computationally intensive, with the proportion of math operations to memory operations being high. Math is fast, and memory is slow, so having lots of math operations keeps the machine busy while memory accesses slow it down. The speaker explains that low intensity loops spend most of their time waiting for memory, whereas higher intensity loops spend more time doing math operations, which is where GPUs excel. Data parallelism, which means doing the same independent operations on lots of data, is also explored in this section. Examples include modifying pixels in an image or updating points on a grid.

  • 00:05:00 In this section, the speaker explains how data parallel execution works in OpenCL. He states that it essentially involves independent operations on a lot of data, and that this is called data parallel execution. The speaker goes on to explain that this type of execution may result in performance loss due to the variations in computations done on the data, such as those that may occur when performing operations on different colored pixels. He then discusses the use of single and double precision in OpenCL, stating that double precision requires twice as much data as single precision and is, therefore, more expensive to execute.
 

Local and Global Dimensions in OpenCL



Local and Global Dimensions in OpenCL (3)

This video delves into the concept of global and local dimensions in OpenCL and how they are used to specify parallelism in code execution. The global dimension is a 1D, 2D, or 3D array that determines the number of threads or work items to be executed for each kernel execution. For instance, if the global dimension is a 3D array with a thousand points, each point will have a thread or work item executed. Meanwhile, the local dimension divides the global dimension into local workgroups or groups of threads that run together, facilitating synchronization. Synchronization is only permitted within the same workgroup, making it critical to select local dimensions that allow for the required synchronization. To sum up, the global dimension establishes the number of threads or work items for each kernel execution, while the local dimension partitions the global dimension into workgroups that enable synchronization. Selecting suitable local dimensions is crucial for synchronization, given that it can only occur within the same workgroup.

Local and Global Dimensions in OpenCL (3)
Local and Global Dimensions in OpenCL (3)
  • 2016.04.06
  • www.youtube.com
How to specify parallelism in OpenCL kernels with global dimensions and local dimensions. How to choose the right dimensions.
 

Issues with local dimensions in OpenCL



Issues with local dimensions in OpenCL (4)

The video explores several issues related to local dimensions in OpenCL, including synchronization limitations and device utilization. Synchronization is restricted to the same workgroup on the GPU, and global synchronization is expensive and can only be used at the end of kernel execution. Choosing the right local workgroup size is crucial to avoid wasting hardware, and the speaker suggests selecting dimensions that are nice multiples of the physical hardware size. The video concludes by recommending a trial-and-error approach to find the best dimensions for optimal performance.

  • 00:00:00 In this section, the video explores two issues related to synchronization and device utilization when choosing local dimensions in OpenCL. The local workgroup size is limited to 512 threads, up to 1024, depending on the code complexity, and synchronization can only occur within the same workgroup. The video uses a reduction application to demonstrate how synchronization works and the limitations imposed by workgroup sizes. The video attributes the limited synchronization capability to GPU scalability needs and the cost of supporting arbitrary synchronization elsewhere on the chip.

  • 00:05:00 In this section, the video explores issues with local dimensions in OpenCL. The first example shows how using spin locks can result in a deadlock due to the scheduler's lack of guarantees of forward progress. The video also explains that global synchronization can only be done at the end of kernel execution, making it expensive and forcing programmers to carefully plan their algorithms. Another issue is device utilization when local workgroup sizes are not matched to the size of compute units. This results in wasting parts of the hardware, and to avoid this problem, programmers need to choose dimensions that work well for the problem and match nicely to the hardware size.

  • 00:10:00 In this section, the speaker discusses the factors that influence the choice of local dimensions in OpenCL. They explain that on a GPU, it's best to have over 2,000 work items in nice multiples of the physical hardware size, like 16 or 3,200 for video 64 and AMD. For CPUs, it's best to have twice the number of CPU cores, but this can vary depending on the algorithms being used. The speaker suggests trial and error until the best performance is achieved.
Issues with local dimensions in OpenCL (4)
Issues with local dimensions in OpenCL (4)
  • 2016.04.06
  • www.youtube.com
Handling reductions with local dimensions and problems with spin locks and device utilization on GPUs.
 

OpenCL Compute Kernels



OpenCL Compute Kernels (5)

The instructor explains that OpenCL kernels are C99 code used for parallel computing. The kernels are executed thousands of times in parallel and are the inner loop of the computation. OpenCL features such as vectors, precise rounding and conversions, and intrinsic functions guarantee accuracy. OpenCL's utility functions provide information about work items, such as ID, dimensions, and group IDs, allowing the creation of flexible kernels that can adjust. However, using OpenCL library functions means trading off preference and precision because parallel code operation reordering can affect the sequence of execution and change results, making deterministic execution impossible on all devices.

  • 00:00:00 In this section, the instructor explains that OpenCL kernels are basically just C99 code and are used to specify computations that will be done in parallel. The code is executed thousands of times in parallel and is the inner loop of the computation. The instructor then gives an example of a C function and how it can be executed in parallel using OpenCL kernels. He also talks about some of the features of OpenCL, such as vectors, explicit ability to control rounding and conversions, and intrinsic functions that come with guaranteed accuracy. The utility functions of OpenCL also give information about each work item, such as work item ID, dimensions, maximum number in a particular dimension, and group ids, which helps in writing flexible kernels that can be clever about figuring out which work they should do. Overall, OpenCL enhances the ability to make portable and performant code by providing guaranteed availability and precision.

  • 00:05:00 In this section, the speaker explains the tradeoff between preference and precision when using OpenCL compliance library functions. Although these functions guarantee precision when tested, it does not necessarily mean that applications will generate the same results on all OpenCL machines. The reason is that the compiler may reorder operations in parallel code, affecting the sequence of execution and possibly changing the final results. Therefore, while building code on these library functions is preferred, deterministic execution on all devices cannot be guaranteed.
OpenCL Compute Kernels (5)
OpenCL Compute Kernels (5)
  • 2016.04.06
  • www.youtube.com
How to write compute kernels in OpenCL for parallelism, OpenCL utility functions and intrinsics.
 

OpenCL Runtime Architecture



OpenCL Runtime Architecture (6)

The video discusses the architecture of the OpenCL platform, including its devices like GPUs and CPUs connected through a memory bus. OpenCL contexts are also explained as groupings of devices within a platform, allowing for optimized data transfer between them. Command queues are introduced as a means to submit work to different devices, but the distribution of work among devices needs to be done manually as there is no automatic distribution.

OpenCL Runtime Architecture (6)
OpenCL Runtime Architecture (6)
  • 2016.04.06
  • www.youtube.com
OpenCL architecture: devices, queues, contexts, compute units, data transfer and utilizing multiple devices.
 

Data Movement in OpenCL



Data Movement in OpenCL (7)

The video discusses data movement in OpenCL, where the speaker explains the manual steps required to copy data between the host memory and GPU, and the difference in speed between global and local memory. The global memory in the GPU has faster access, but getting data from the host memory to the GPU is slow. Local memory in OpenCL can provide improved performance with massive bandwidth but is harder to use than caches since it requires manual allocation. Modern Nvidia GPUs offer the choice between manually managing local memory or using it as a cache instead, with the recommended approach to start with a cache before optimizing for local data movement.

  • 00:00:00 In this section, the speaker discusses how data movement works in OpenCL and the manual steps required for copying data from the host memory to the GPU and back. The GPU has global memory that has much faster access than the host memory, but getting data from the host memory to the GPU is slow due to the PCIe bus. The GPU also has local memory that has massive bandwidth, and using it can significantly improve performance. However, allocating and copying data to the local memory needs to be done manually in each compute unit, making it a cumbersome task.

  • 00:05:00 In this section, the speaker talks about local memory in OpenCL, which can range from 16 to 48 kilobytes, and how it can provide higher bandwidth of thousands of gigabytes per second. However, local memory is harder to use than caches because caches automatically place the most recently used data without needing to allocate different parts of the memory for different data, while local memory requires manual allocation. Modern Nvidia GPUs allow the choice between managing local memory manually or using it as a cache, with the recommended approach to start with a cache before optimizing for local data movement.
Data Movement in OpenCL (7)
Data Movement in OpenCL (7)
  • 2016.04.06
  • www.youtube.com
Host to device transfer speeds, local memory.
 

OpenCL Hello World



OpenCL Hello World (8)

In this video, the process of creating a program using OpenCL and submitting it to a GPU device is explained. The speaker walks through the steps of building the program, creating kernels and memory objects, and copying data between CPU and GPU. They also explain the process of setting the kernel arguments and dimensions, executing the kernel, and retrieving the results from the GPU. The speaker notes that complicated kernels may not give optimal performance on both the CPU and GPU and might need to be fixed to improve performance. They compare the process of programming in OpenCL to solving a math problem, where operations are repeated until the desired result is achieved.

  • 00:00:00 In this section, the speaker explains the steps needed to set up OpenCL and create a program using it. Firstly, devices and platforms need to be set up and a context must be created for executing commands. Then, command queues are created to submit work to different devices. The code is then compiled to get kernel objects that can be submitted to the queues. Memory objects are created to exchange data between devices and arguments are set up for the kernel. The kernel is then queued for execution and data is copied back from the device to the CPU. Finally, all commands need to be completed and a wait is implemented to ensure that the data is returned as expected. The speaker also walks through an example OpenCL Hello World program that calculates sine of x in parallel using devices.

  • 00:05:00 In this section of the video, the speaker explains the process of creating a program using OpenCL and submitting it to a GPU device. They start by building the program which takes longer the first time but subsequent times it does not. They then create a kernel object to a particular kernel in the program by calling CL create kernel. After that, they create a memory object, allocate a space on the device, and then copy data from the CPU to the GPU using CL in queue write buffer. The speaker then sets the kernel arguments and dimensions and executes the kernel using CL in Q nd range kernel. Finally, the speaker retrieves the results from the GPU and waits for everything to finish by calling CL finish. The speaker concludes by stating that complicated kernels may not give optimal performance on both the CPU and GPU and might need to be fixed to improve performance.

  • 00:10:00 In this section, the speaker explains that programming often involves repeating certain commands until achieving the desired final output. He compares it to solving a math problem, where one would do a set of operations repeatedly until reaching the correct answer. He notes that this process is similar when using OpenCL, where programming commands are repeated multiple times until the desired result is achieved.
OpenCL Hello World (8)
OpenCL Hello World (8)
  • 2016.04.06
  • www.youtube.com
Writing a simple Hello World parallel program in OpenCL for GPUs: device setup, kernel compilation, copying data.
 

More OpenCL Features



More OpenCL Features (9)

The video discusses additional features of OpenCL such as device querying, image handling, and events. Users can use the cl_get_device_info command to find out details about their devices, although these values may not always be entirely precise. OpenCL's native support of 2D and 3D image types can be slow without hardware support on CPUs, but is hardware-accelerated on GPUs. Events are essential when working with asynchronous command execution and multiple devices, serving as cues for different devices that require synchronization between them. The speaker provides an example of using events to ensure that Kernel B waits for Kernel A to finish before running by enqueueing the kernels with respective events, copying the output, and waiting for the events to provide synchronization.

  • 00:00:00 In this section, the speaker discusses additional features of OpenCL including querying devices, handling images, and OpenCL events. Through querying devices using the cl_get_device_info command, users can find out information about their devices such as the number of compute units, clock frequency, global memory size, and more. However, the speaker cautions that these values may not be as precise as desired. OpenCL supports 2D and 3D image types natively, which can be linearly interpolated, wrapped around edges, or clamped at edges. While these features are hardware accelerated on GPUs, they are slow on CPUs without hardware support. Finally, events are important when working with asynchronous command execution and multiple devices, as cues for different devices are asynchronous with respect to each other, requiring synchronization between them.

  • 00:05:00 In this section, the speaker explains events and their usage in OpenCL. Every enqueue command has three things at the end which are the number of events in the list, wait list, and events returned. They allow users to return an event to track and find out if the kernel is done, have other things wait for the kernel to finish, and even get profiling information. The speaker provides an example of using events to make sure kernel B on the GPU waits for kernel A on the CPU to finish and copy its output to the GPU before running. It involves enqueueing the kernel with an event, doing a copy, waiting for that event, and having the second kernel wait on the copy to ensure synchronization.
More OpenCL Features (9)
More OpenCL Features (9)
  • 2016.04.06
  • www.youtube.com
System information, Image types, events.
 

OpenCL Performance Tips and Summary



OpenCL Performance Tips and Summary (10)

The video discusses tips for optimizing OpenCL performance, which includes minimizing data transfers, optimizing memory access, using producer-consumer kernels, and utilizing vectors and fast math functions. The speaker highlights that applications suitable for GPUs should be data parallel, computationally intensive, avoid global synchronization, comfortable with single precision, and manageable with small caches. If experiencing poor performance with OpenCL, it may be necessary to reconsider the algorithm and optimize memory locality, shared or local memory, and avoid unnecessary synchronization between work items.

  • 00:00:00 In this section, the speaker discusses tips for optimizing OpenCL performance, including minimizing data transfers between CPU and GPU by keeping data on the device as long as possible, and using producer-consumer kernel chains. The speaker also emphasizes the importance of optimizing memory access by optimizing for memory coalescing and managing local memory on GPUs. Additionally, the speaker notes that the use of vectors can improve performance on certain hardware, and using fast or native variants of certain math functions can result in a significant speed boost. Lastly, the speaker discusses the characteristics of applications that are a good match for GPUs, including being data parallel, computationally intensive, not requiring global synchronization, comfortable with single precision, and manageable with small caches.

  • 00:05:00 In this section, the speaker suggests that if you are experiencing poor performance with OpenCL, you may need to reconsider your algorithm and choose one that fits better with the parallel processing pattern. This may involve changing the order or structure of your code to optimize memory locality, utilizing shared memory or local memory, and avoiding unnecessary synchronization between work items.
OpenCL Performance Tips and Summary (10)
OpenCL Performance Tips and Summary (10)
  • 2016.04.06
  • www.youtube.com
OpenCL kernel and runtime performance optimizations, checklist for using OpenCL.