OpenCL in trading - page 3

 

AMD Developer Central: OpenCL Programming Webinar Series.2. Introduction to OpenCL



2- Introduction to OpenCL

This video provides a detailed introduction to OpenCL, a platform for parallel computing that can use CPUs and GPUs to accelerate computations. Programs written in OpenCL can be executed on different devices and architectures, allowing for portable code across different platforms. The video discusses the different execution models in OpenCL, including data parallelism and task parallelism, and also covers the different objects and commands used in OpenCL, such as memory objects, command queues, and kernel objects. The video also delves into the advantages and limitations of using OpenCL, such as the need for explicit memory management and the potential for significant performance improvements in parallel programs.

  • 00:00:00 In this section, the speaker introduces OpenCL and its ability to use CPUs and GPUs to accelerate parallel computations resulting in significant speed ups. Although numbers like 100X or 1000X are sometimes quoted, realistically, speed-ups of around 10-20X are expected for optimized programs. OpenCL can write portable code across different devices and architectures; therefore, programs written for AMD GPUs can typically run on NVIDIA GPUs as well. With AMD implementation, both the CPU and GPU implementation of OpenCL are provided, unlike some of the other competitors' implementations. The section ends with an overview of heterogenous computing and how OpenCL fits into it.

  • 00:05:00 In this section, the speaker provides an introduction to OpenCL, which is a low-level and verbose platform that can be rewarding in terms of performance if the application has the right characteristics. OpenCL is a platform-based model that consists of a host API, a model of connected devices, and a memory model. The devices are seen as a collection of compute units, and each compute unit is broken down into processing elements that execute in SIMD. The execution model is based on the notion of a kernel, which is the unit of executable code that can be executed in parallel on multiple data. Additionally, OpenCL provides a set of queues that allow asynchronous execution of read and write operations and kernel execution, which can be in order or out of order.

  • 00:10:00 In this section, the speaker discusses the two main execution models in OpenCL: data parallelism and task parallelism. The data parallel model is the most efficient for execution on GPUs and involves an N dimensional computation domain where each individual element is called a work item that can potentially execute in parallel. The execution of work items can be grouped into local dimensions called work groups that get executed on one of the available compute units. The speaker also explains how the loop is implicit in the data parallel world of OpenCL and how get global identifier routine is used to index into a and b.

  • 00:15:00 In this section, the speaker discusses the local memory, which is similar to a cache and is managed by the user, being low-latency and faster than global memory. The synchronization function allows work items to write to a memory location, wait for completion, and then have another work item read that memory location and get the value written by the previous work item. Synchronization can only be done within the workgroup. OpenCL also supports task parallelism, executed by a single work item, and benefits from using the OpenCL queuing model and synchronization features of the host API, allowing for natively compiled code. OpenCL differs from C in that the memory hierarchy of particular GPUs is explicitly exposed, and the programmer can induce an ordering and consistency of the data through the use of barrier operations and synchronization.

  • 00:20:00 In this section, the speaker discusses the advantages of OpenCL, such as its powerful performance capabilities. However, they point out that the memory management in OpenCL requires explicit operations, which could be complicated and requires careful consideration of the algorithm. The compilation model is based on OpenGL, and it has an online compilation model that allows for passing in streams or strings of OpenCL source code for compiling online. OpenCL is built around the context, a collection of devices, and memory objects, and queues are used to submit work to a particular device associated with the context. Memory objects are buffers, which are one-dimensional blocks of memory that can be thought of as arrays.

  • 00:25:00 In this section, the speaker explains the different OpenCL objects, including memory objects, images, programs, and kernels. Memory objects can be buffers or image types, where images have a hardware implementation to optimize access. Programs that define kernels for execution may be built and extracted, and kernel argument values may be set using kernel objects. Command queues are used to enqueue kernels and other commands for execution. Additionally, OpenCL events are useful for building dependencies between commands and querying the status of commands. The speaker also gives examples of how to query for devices and their IDs.

  • 00:30:00 In this section, the speaker explains how OpenCL returns error codes and objects. If a function returns a CL object, it will return that object as a result. But if it's not returning a CL object, it will return the error code as its result. The context is discussed, and how memory is allocated at the context level, meaning that buffers and images are shared across devices. The speaker also mentions the CL get X info function, specifically CL get device info, which allows developers to query device capabilities to determine what device is best suited for the algorithm. Finally, the speaker discusses buffers and images, and how kernels can access them, along with the restrictions on image access.

  • 00:35:00 In this section, the speaker discusses how to allocate buffers and images in OpenCL, including how to describe how the buffers will be accessed and the benefits of using images. The speaker also explains how to access memory object data using explicit commands, and how to in queue commands to a queue. Additionally, the video explains how to map a region and transfer data between buffers as well as the advantages and disadvantages to using DMA units. The section concludes by discussing program and kernel objects and setting argument values.

  • 00:40:00 In this section, the speaker discusses the dispatch and dependencies in OpenCL. They explain how the dispatch will execute based on the domain or grid of execution, and how dependencies can be set up to ensure that things do not overtake each other. The speaker also explains the arguments for the NQ command, which take into account the number of events in the waitlist and the event associated with the command. Finally, the speaker gives an overview of the OpenCL C language, which is based on C and has certain restrictions and additions, such as vector types and synchronization primitives. The language allows for work items and work groups, as well as address space qualifiers and built-in functions.

  • 00:45:00 In this section, the speaker gives a brief overview of OpenCL and its features, such as the different address spaces, vector types, and scalar types that can be used in programming kernels. They also discuss how to create memory objects and build and execute programs using the OpenCL API. The speaker then addresses a question about how data parallelism in OpenCL differs from loop unrolling in compilers.

  • 00:50:00 In this section, the speaker explains the concept of data parallel execution and the difficulties of components doing it efficiently. He also emphasizes the need to explicitly parallelize programs for OpenCL or any other model. The inq_marker is another topic that is covered and how it is useful in out of order cues. The speaker reiterates that constant memory means the values are constant and it is used on special GPUs to load up into very fast constant memory, which is read-only. He suggests checking out the OpenCL Zone on the MD website for more information on OpenCL and parallel programming. Lastly, he talks about how get_global_ID(0) would return the same value for every call of the kernel.

  • 00:55:00 In this section, the speaker explains that when two different applications are both running and trying to use OpenCL on the same machine, all implementations today will share the hardware and the OS will multiplex the applications. They recommend using visual profilers for OpenCL, such as the Visual Studio plugin or the Linux command line version, which allow one to query information about id Hardware. The overhead of loading data into image object or buffer would depend on the device, where transferring it through the PCIe bus would have greater latency. Finally, the speaker mentioned that the new AMD 680 hundred GPU series and the best programming practices for them is similar to the Evergreen architecture, which they are based on.
 

AMD Developer Central: OpenCL Programming Webinar Series. 3. GPU Architecture



3 - GPU Architecture

This video provides an overview of GPU architecture, taking note of the origins and primary use of GPUs as graphics processors. GPUs are designed for processing pixels with a high degree of parallelism, in contrast to CPUs designed for scalar processing with low latency pipelines. The architecture of GPUs is optimized for graphics-specific tasks, which may not be suitable for general-purpose computation. The speaker explains how the GPU maximizes the throughput of a set of threads rather than minimizing the execution latency of a single thread. The architecture of the GPU engine block is also discussed, including local data shares, wave fronts, and work groups. The video explores various GPU architecture features that help increase the amount of packing the compiler can do, including issuing dependent operations in a single packet and supporting dependent counters with global beta share. Although the GPU and CPU core designs may be similar, their workloads will need to converge for them to have similar designs.

In this video on GPU architecture, the speaker delves into the concept of barriers and their function. When a work group contains multiple wavefronts in a GPU, barriers are used to synchronize these wavefronts. However, if only one work wavefront exists in a group, barriers are rendered meaningless and will be reduced to non-operations.

  • 00:00:00 In this section, the presenters introduce the purpose of the webinar, which is to provide an overview of GPU architecture from a different perspective compared to the traditional approach of describing the low-level architecture or optimizations. The presenters aim to put the GPU architecture into context by discussing its origins and primary use case as graphics processors. They explain how GPUs are designed for processing pixels with a high degree of parallelism, which is different from CPUs designed for scalar processing with low latency pipelines. The presenters also touch on how the architecture of GPUs may not be suitable for general-purpose computation due to their optimized hardware blocks designed for accelerating graphics-specific tasks.

  • 00:05:00 In this section, we learn about executing fragment programs that are independent and relate to a single pixel, written in GLSL or HLSL, and how this programming pattern allows for efficient parallelism with no dependence analysis or inter-pixel communication. The hardware is designed to execute shader code on multiple pixels at once, known as a wavefront, but with the problem of block code and branches. The issue arises when all branches go the same way, causing a problem, but the hardware generates a mask to ensure that all pixels are executing the same instruction.

  • 00:10:00 In this section, the speaker discusses how Cindy execution requires simply instructions and the use of vector instructions. While vector instructions can be generated by hardware or a compiler, it can make development fiddly and difficult due to the need for manual packaging of different operations, branch masking, and careful hand-coding. On the other hand, programming Cindy execution using vector instructions has the risk of making the developer think that the lanes branch independently, which isn't true. Despite this, it is still easier for programmers to think about and for current AMD GPUs, masking is controlled by hardware. This is important to consider for compute, which is the purpose of the talk, as it can affect performance, especially with branch divergence in the wider Waveland.

  • 00:15:00 In this section, the speaker discusses the visible aspects of GPU architectures that are based on throughput computing, which means that if an instruction vector stalls in a GPU, which can happen if the floating-point add takes some cycles to complete, the same instruction can be used to cover the stall time in the next vector, making instruction decode become more efficient. The speaker explains that instead of increasing vector width which can reduce utilization of ALUs, multiple wavefronts can sneak in instructions from other running threads, giving the probability of not stalling simply while waiting for texture units to return memory data. However, a single wavefront takes longer to execute in this way because of how the architecture works.

  • 00:20:00 In this section, the video explains how the GPU maximizes the throughput of a set of threads rather than minimizing the execution latency of a single thread. This means the GPU tries to increase the utilization of threads by waiting for the first thread to complete execution and then feeding in the next wavefront, while the results of the first set of pixels are reused to keep the pipeline as close to full occupancy as feasible. The GPU maintains a large pool of registers to cover the full width of the vector for each thread in flight, taking up space on the device, which scales with the number of states and the width of the vector. The GPU is designed to cover latency, and therefore instead of minimizing latency, the GPU maximizes the memory bandwidth available so that it can satisfy all parallelism, using texture caches and local memory for the reuse of data between work items.

  • 00:25:00 In this section, the speaker discusses how caches and program-controlled shared memory regions can achieve data transfer reduction by allowing data to be copied just once from the main memory interface and then reused by different work items in varying paths. They explain how the texture cache is designed to automatically enforce 2D structure on its data efficiently to capture 2D accesses by the quads. Local data regions provide considerably more control, but it's the developer's responsibility to structure loads efficiently to use this memory and to share data to reduce the global memory requirements. The section also explores how the GPU can be viewed as a collection of wired Cindy cores with multiple program states interleaving 4, 8, or 16 threads to cover pipeline latency. The trade-off between the number of cores and ALU density is discussed, with the benefit of increasing utilization, depending on the workload.

  • 00:30:00 In this section, the speaker discusses the similarities and differences between CPU and GPU core designs, using examples such as Andy Phenom 2 X6 6 and Intel i7. While the approach taken by Pentium 4 and ultraSPARC T2 involved larger core sizes with multiple state sets to increase instruction level parallelism, GPUs are at the extreme end of the spectrum with a high degree of data balancing. The technical details of AMD Radeon HD 5870 are also discussed, pointing out its high bandwidth and the number of concurrent waveforms available which depend on the number of registers used by each wavefront. The speaker concludes that while there may be similarities in the design space between CPU and GPU nines, their workloads will need to converge for them to have similar designs.

  • 00:35:00 In this section, we learn about the GPU's design elements, including local data shares and the wave fronts and work groups that it is divided up into. The GPU's design includes a Colonel's dispatch command processor that generates wave fronts, which are assigned to an available SIMD unit that has enough resources to fit the space. The entire chip has 20 Cindy engines with 8 GDDR5 memory banks for crossbars and caches. In addition, it features a relaxed global memory consistency model, which requires fence instructions to ensure visibility rights, allowing the Cindy engines and fixed units to maintain a high degree of data parallel execution as possible without the power performance overhead. The GPU uses a clause-based execution model, enabling it to execute many control flow programs simultaneously on the simplest engine.

  • 00:40:00 In this section, the architecture of the GPU engine block is discussed. The engine block has two major components: local data share, which allows sharing of data between work items in a workgroup and the post-testing elements, or stream cores, which execute instructions from ALU clauses in the kernel. The local data share has 32 banks, each of the 16 processing elements in the engine can request to read or write to 32-bit words from LDS on each cycle at arbitrary addresses, and conflicts are detected by the unit. Atomic operations are performed using integer-only ALUs, and floating-point atomics would be more complicated to perform. The processing element of the 5870 architecture is a cluster of five identical ALUs that operate on a very long instruction word packet of five operations packaged by the compiler, with some set of own dependencies, and most basic operations can be executed.

  • 00:45:00 In this section, the speaker describes various GPU architecture features that help increase the amount of packing the compiler can do, including issuing dependent operations in a single packet and supporting dependent counters with global beta share. The global beta share is a lesser-known feature that is connected to the entire set of compute engines on the device and has a far lower latency than accessing other memories. The speaker also warns that accessing random pixels in the texture cache may cause performance issues, as it will be faster than doing global memory accesses only if the data is clustered together.

  • 00:50:00 In this section, the speaker answers questions related to achieving full occupancy in GPU architecture. The example given is that work groups should consist of multiples of 64 work items to get full occupancy. Wavefronts and the number of wavefronts that can fit on each core also impact full occupancy. The speaker also mentions that there are no full precision versions of the transcendental functions generated by the 5th Lane which are fast approximations that may or may not be good enough, depending on the needs of the code. When asked if there is a way to query the wavefront size within all devices, the answer is that there is no such way.

  • 00:55:00 In this section, the speaker explains what fully coalesced means in terms of global memory accesses in a GPU architecture. Essentially, it means that a quarter wave front will issue a memory request in which each lane will access 128 bits consecutively from aligned addresses going across the compute unit. However, there are levels of efficiency depending on the access type, whether it is attached or detached, and whether it is a random gather. The speaker also clarifies that a wavefront is a unit of work consisting of 64 work items executed together with a single instruction, and it is not the same as a workgroup, which is a set of work items.

  • 01:00:00 In this section, the speaker explains the concept of barriers in GPU architecture. If a work group has multiple wavefronts, issuing a barrier instruction will synchronize these wavefronts. However, if there is only one work wavefront in a group, barriers will be reduced to non-operations and hold no meaning.
 

AMD Developer Central: OpenCL Programming Webinar Series. 4 OpenCL Programming in Detail



4 - OpenCL Programming in Detail

In this video, the speaker provides an overview of OpenCL programming, discussing its language, platform and runtime APIs. They elaborate on the programming model that requires fine-grained parallelization, work items and groups or threads, synchronization, and memory management. The speaker then discusses the n-body algorithm and its computationally order n-squared nature. They explain how OpenCL kernel code updates the position and velocity of particles in Newtonian mechanics, introduces cache to store one particle position and how the kernel updates the particle position and velocity using float vector data types. The speaker also delves into how the host code interacts with OpenCL kernels by setting the parameters and arguments explicitly, transferring data between the host and GPU, and enqueuing kernel execution for synchronization. Finally, the video explores how to modify the OpenCL code to support multiple devices, synchronize data between the GPUs, and set device IDs for half-sized arrays representing them.

The second part discusses various aspects of OpenCL programming. It covers topics such as the double-buffer scheme for synchronizing the updated particle position between two arrays, OpenCL limitations and the difference between global and local pointers in memory allocation. Additionally, it highlights optimization techniques for OpenCL programming, including vector operations, controlled memory access, and loop unrolling, along with tools available for analyzing OpenCL implementation, such as profiling tools. The presenter recommends the OpenCL standard as a resource for OpenCL programmers and provides URLs for the standard and the ATI Stream SDK. The video also addresses questions on topics such as memory-sharing, code optimization, memory allocation, and computation unit utilization.

  • 00:00:00 In this section, the speaker introduces OpenCL concepts, discussing the emergence of hybrid CPU-GPU architectures and the challenges they pose for programming. OpenCL provides a platform and device-independent API with industry-wide support, and the speaker outlines the three parts of the OpenCL API: the language specification, platform API, and runtime API. The execution model is broken up into two parts: the kernel, which represents the executable code that will run on the OpenCL device, and the host program, which performs memory management and manages kernel execution on one or more devices using the command queue. The C language extension used for kernel programming is based on ISO C99 with some restrictions and additions to support parallelism.

  • 00:05:00 In this section, the speaker explains the programming model of OpenCL, which requires fine-grained parallelization across threads and synchronization. Work items and work groups are introduced as threads, and are grouped into work groups which have special properties in terms of synchronization and access to shared memory. The speaker also covers the execution model on the host side, explaining that everything is gathered together in a context which includes a collection of devices, program objects, kernels, memory objects, and the command queue for queuing up kernels and memory or data transfer operations. OpenCL also supports a hierarchy of different types of memory to reflect the nature of memory in a hybrid system with distributed memory.

  • 00:10:00 In this section, the speaker discusses the importance of managing memory and synchronization when using OpenCL programming. OpenCL has a relaxed memory consistency model, making it the programmer's responsibility to manage data transfers and control when data is moved from one device to another. Synchronization becomes essential when using multiple devices, and it's the programmer's responsibility to make sure events, including kernel execution and data transfers, are synchronized correctly. The speaker introduces the use of Standard CL, which is a simplified interface to OpenCL and provides a default context that is ready to use, includes all devices, and is uninterrupted from full OpenCL functionality. Additionally, Standard CL simplifies memory management through CL Maylock, which allocates memory that is shareable across OpenCL devices.

  • 00:15:00 In this section, the speaker discusses the basic n-body algorithm, which models the motion of nparticles subject to some form of particle-particle interaction. The algorithm involves calculating the force on each particle by summing the contributions of the interaction with all other particles in the system. Once the force on every particle is known, the particle position and velocity are updated over some small time step. This process is repeated for every particle, resulting in a simulation of these particles moving subject to the interaction forces. The algorithm is computationally order n-squared, which allows for good speed up using coprocessors with limitations in memory transfer bandwidth. The entire algorithm can be written in just a few dozen lines of code in C.

  • 00:20:00 In this section, the speaker explains the implementation of the n-body algorithm through a loop that accumulates the interaction between particles. The kernel code is designed to provide a reasonably standard implementation using good practices from an OpenCL perspective, but it may not be optimal for a particular architecture. The kernel will be executed for every work item within an index space, where each thread is responsible for updating the position and velocity of a single particle. The application has a simple one-dimensional index base with the number of work items equal to the number of particles in the system. The host code is also essential for initialization, memory management, and coordinating operations on OpenCL devices.

  • 00:25:00 In this section, the speaker introduces the kernel code for updating the position and velocity of particles in Newtonian mechanics. The prototype for the kernel code is similar to that of a C function, but with some qualifiers that qualify the address space and the use of a blocking scheme. The kernel code is stored in a separate file and used in just-in-time compilation when the program runs. The speaker then explains how the kernel code does size and index determination before getting into the actual calculation of physics. The global and local thread IDs are also explained in detail, and the speaker notes that an outer loop is implicit in the kernel code as the kernel will be automatically executed for every particle in the system.

  • 00:30:00 In this section, the speaker explains the implementation of the OpenCL kernel for the pairwise force calculation using a cache. The kernel caches one particle position and relies on the other work items in the work group to fill the cache as well. Once the 64 particle positions are cached, the kernel loops over the cached particle positions and implements the same force calculation as shown in the C code, with notable differences specific to OpenCL. These include using a float vector for the particle position, an OpenCL built-in function for square root, and storing the mass in the unused fourth component of the float vector for convenience. The kernel updates the particle position and velocity on a single clock using the float vector data type. Finally, the speaker explains the need for barriers for synchronization during the cache fill-up and loop operations.

  • 00:35:00 In this section, we learn about how the new position and new velocity are written back to global memory in a new array to avoid overwriting data that is still required by other threads. This double buffering scheme is used in later stages to safely update the particle position without encountering any thread concurrency issues. Moving on to the host code implementation of the kernel, we learn about how the program sets the parameters, allocates memory, initializes positions and velocities, builds and compiles the OpenCL kernel, queries by name the necessary current n-body kernel, creates a one-dimensional kernel computational domain, and sets the kernel arguments explicitly. Overall, the code shows how OpenCL differs from basic C programs in that the host executes the kernel by proxy, and how the arguments for the kernel must be set directly.

  • 00:40:00 In this section, the speaker explains the process of transferring data from the host to the GPU using a clm sink call and synchronizing the arrays on the device (GPU) using the flags for blocking calls. They then discuss the loop over steps, introducing the value and burst that will be used for diagnostic purposes, and delaying setting the kernel arguments two and three for double buffering. The speaker notes that kernel execution is enqueued and the CL wait synchronization call is used to ensure that all kernel executions have completed before proceeding. Finally, they bring the data back from the GPU to the host using CL enqueue read buffer.

  • 00:45:00 In this section, the video discusses modifying the code to support multiple devices, specifically running the code on two GPUs. The approach involves dividing up the work for force calculation and particle position updating between the two GPUs, with one GPU responsible for one half of the particles. The kernel code sees little exchange, with just an additional argument added to the prototype called "position remote", which points to the particle positions for particles that a given GPU is not responsible for updating, but still needs in order to calculate the total force. There are notable changes on the host side involving memory management and synchronization issues that arise due to the use of two devices.

  • 00:50:00 In this section, the lecture explains the changes that need to be made to the kernel code in order to perform the particle simulation on two GPUs. The loop over the cache particle positions remains the same as before, but before line 21, the loop is now over local particle positions that the GPU owns. The code is then repeated for remote particles that the other GPU is responsible for updating. The code for updating particle position and velocity remains the same. To modify the host code for two GPUs, the initialization remains the same, but arrays for particle positions and velocities that hold half of the particles are also allocated, with A and B representing the two GPUs. A computational domain with an index space of only one-half the original size is created and the code for statically setting the argument pointer to the velocity array is removed.

  • 00:55:00 In this section, we learn how to set the device ID in order to define which GPU the data will be written to when copying over half-sized arrays to the respective GPUs. We also see that in order to exchange data between multiple GPUs, we need to synchronize the updated particle positions back to the GPUs by switching the half-sized particle arrays representing them. We also learn that for some implementations of OpenCL, a CL flush call may need to be introduced for true concurrence, but this is not part of the standard.

  • 01:00:00 In this section, the presenter goes through the code, which implements the double-buffer scheme to swap the old and new particle positions between two arrays. This scheme ensures that the updated particle positions are put back into either array A or B. Once the particles are synchronized to the host, they need to be copied back into the larger array to reuse some of the helper functions. The presenter recommends the OpenCL standard as a necessary resource for OpenCL programmers, and provides URLs for the standard and the ATI Stream SDK. The presenter also answers questions about whether OpenCL can be used for algorithms like singular value decomposition or non-negative matrix factorization, and confirms that the data in global memory on the GPU stays the same between kernel executions, which is crucial for more complex algorithms.

  • 01:05:00 In this section, the video discusses the current limitations of the OpenCL API, which only supports Unix/Linux systems and is being worked on for a Windows port. The video also addresses the question of sharing memory between a GPU and the host, explaining that while it is possible, there are performance penalties, making it typical to use global memory allocated on the graphics card. When allocating memory through the OpenCL API, there are ways to control how it is handled, but some of this is done automatically at the implementation level. Additionally, the video explains the difference between global and local pointers in memory allocation and how choosing the optimal number of cores depends on the algorithm being executed.

  • 01:10:00 In this section, the speaker discusses various questions related to OpenCL programming such as whether to load and run the FTL FGL RX module, compatibility of Lib standard CL with C++ bindings, the optimization of OpenCL kernels for performance improvement, and the use of CL mem read-only or sealing them read/write in when allocating memory buffers. The speaker also points out that specific tools for editing OpenCL kernels and optimizing for AMD GPUs may exist while also noting that optimization techniques can be very subtle and require a lot of tuning.

  • 01:15:00 In this section, the speaker discusses optimization techniques for OpenCL programming, including organizing data to take advantage of vector operations, carefully controlling memory access for better performance, and manually unrolling loops for optimization. Additionally, the use of complex data types and the ability to transfer data between GPUs without going back to the host are implementation-specific and may vary depending on the compiler. The speaker also mentions that there are size limitations for memory buffers that will depend on available memory across OpenCL devices in the system. It may be possible to store simpler float parameters in constant memory for better performance.

  • 01:20:00 In this section, the speaker explains that there are tools available in various SDKs to analyze OpenCL implementation, including profiling tools to measure computational unit utilization or memory transfer utilization. The speaker also clarifies that managing memory fragmentation on the GPU side is implementation-specific, and that the implementation must manage it properly. When using double precision, there is no obvious answer on whether to lower the local work group size to 32 or 16, as it depends on the architecture being used. The speaker also mentions the availability of helper calls to easily obtain information on all devices within the standard GPU context.
 

AMD Developer Central: OpenCL Programming Webinar Series. 5. Real World OpenCL Applications



5 - Real World OpenCL Applications

In this video, Joachim Deguara talks about a multi-stream video processing application he worked on, with a key focus on performance optimization. The video covers various topics such as decoding video formats, using DMA to transfer memory between the CPU and GPU, double buffering, executing kernels, using event objects to synchronize and profile operations, OpenCL-OpenGL interop, processing swipes in videos, and choosing between OpenCL and OpenGL when processing algorithms. Joachim also discusses various sampling and SDKs available for OpenCL applications, but notes that there is currently no sample code available for the specific application discussed in the video.

  • 00:00:00 In this section, Joachim Deguara explains a multi-stream video processing application that he has worked on. The application involves opening multiple video streams, decoding them, applying video effects to them, combining them, and finally presenting one video stream which gets repeatedly processed at the end. Joachim Deguara mentions that performance is a key focus for this application, and to achieve real-time presentation, it is essential to fit decoding, processing, and display into a looping structure that occurs 30 times a second or the frame rate of the input video.

  • 00:05:00 In this section, the focus is on decoding video formats and moving the frame to the GPU. To decode the video format, it is necessary to ensure that this is done as fast as possible so that it does not affect performance. One way to do this is to have the decode function running separately from the main loop and when called, for it to return the latest frame. This ensures that decoding is done in the background while the main loop continues uninterrupted. To move the frame to the GPU, API calls are used which include specifying the image to write to and whether the write should occur synchronously or asynchronously.

  • 00:10:00 In this section, the speaker discusses DMA (direct memory access) and how it can be used to transfer memory between the system's main memory and the GPU's memory without the CPU having to copy it. The DMA engine handles this operation in parallel, freeing up resources for the CPU and GPU. The transfer of buffers and images can be done asynchronously and requires special flags. However, data cannot be immediately used after a copy, so the program needs to be restructured accordingly to take advantage of DMA. The speaker suggests double buffering and restructuring loop processes to avoid corrupting currently processed or displayed data. Overall, DMA can greatly improve application performance by offloading CPU and GPU cycles.

  • 00:15:00 In this section, the speaker discusses the double buffering approach used in OpenCL to process and display frames of a video. This approach involves constantly buffering two frames, A and B, and processing one while the other is being uploaded. This eliminates the need for the processing time and the upload time to be added together, and instead only takes the maximum time it takes for either process. The speaker also discusses setting up arguments for the kernel, which only needs to be done once and can be used for all subsequent processing executions if the parameters do not need to be changed.

  • 00:20:00 In this section, the speaker discusses how to execute the kernel and mentions two ways to call the processing, including blocking and non-blocking methods. While blocking enables easier debugging, it is not optimal for performance, so the speaker introduces the option to use events and vectors of events to wait for operations or to set up dependency graphs for multiple operations. By using the event objects to signal the completion of certain work items, it can ensure that downstream operation only starts once those are done, allowing for more efficient processing.

  • 00:25:00 In this section, the speaker explains how to use OpenCL event objects to synchronize operations and how to use event profiling for performance testing. When creating the dependency graph for filters, a pointer to an event is passed as the last argument for Enqueue Operation, but the event object is created inside the InQ. This can cause confusion, but it allows for setting up dependencies between filters and upload operations. The speaker describes how, with event profiling, it is possible to get timestamps back from the event as to when certain events occurred in the event's life cycle, such as when that be operation of the event references was queued, was submitted, started running, and completed running. This enables profiling while keeping all operations asynchronous.

  • 00:30:00 In this section of the video, the speaker explains the different states that an event goes through when using OpenCL, such as queued, submitted, running, and completed, and how to use event profile data and timestamps to measure performance and identify potential issues like long execution times for operations or uploads of data. The speaker also discusses how to wait for events to complete in order to ensure the accurate display of video streams with various filters and effects.

  • 00:35:00 In this section, the speaker discusses OpenCL and OpenGL interop, which allows sharing of certain information between the two. This functionality is optional and thus not all implementations must support it. The speaker stresses the importance of checking for the extension and creating a context in OpenCL with certain flags to turn on OpenCL-OpenGL interop. The way this works is by creating an OpenCL image from an OpenGL texture that has already been created, so data is not unnecessarily copied back and forth.

  • 00:40:00 In this section, the speaker explains how OpenCL and OpenGL can share image data together through Interop. They cover the target, NIP level, and texture object that are needed to reference Open GL texture. Once created, OpenCl image can be used as a normal OpenCl image, but the two programs need to do some handshaking in order to ensure they don't interfere with each other. The speaker also answers a question on how to create transitions in video rendering. He says it can be done by using the position of a swipe as an input into the filter. Ultimately, the final result is turned over to OpenGL for display purposes, which completes all of the steps.

  • 00:45:00 In this section, the speaker explains how they process swipes in a video by looking at the time marker of each frame and using a keyframe to interpolate the position of the swipe. They also answer a question about profiling in OpenCL, stating that timestamps are dependent on an external high-resolution timer but not on a call to CL finish. Additionally, the speaker discusses the order of execution in OpenCL devices and runtimes and confirms that operations are handled in order for most devices.

  • 00:50:00 In this section, the speaker explores the differences between OpenCL and OpenGL when processing algorithms. The decision of which platform to use depends on individual preferences, although OpenCL may be easier to program due to its comprehensive language structure. In terms of performance, OpenCL may allow processing applications that require more complex hardware; however, there are cases where using OpenGL shading may lead to better performance. Furthermore, while the speaker could not provide any available sample code for the specific application, there are various examples of code in AMD's OpenCL SDK that users can learn from.

  • 00:55:00 In this section, the speaker discusses various samples and SDKs available to developers for OpenCL applications. The samples show how to query devices and run time to get extensions, as well as providing examples of OpenCL OpenGL interoperation. However, there is no sample code currently available for the specific application discussed in the video, but this may change in the future. The webinar has now concluded, and a recording will be made available to attendees.
 

AMD Developer Central: OpenCL Programming Webinar Series. 6. Device Fission Extensions for OpenCL



6 - Device Fission Extensions for OpenCL

In this video, the speaker covers various topics related to device fission extensions for OpenCL. They explain the different types of extensions and how device fission allows large devices to be divided into smaller ones, which is useful for reserving a core for high-priority tasks or ensuring specific work groups are assigned to specific cores. They discuss the importance of retaining sequential semantics when parallelizing vector pushback operations, using parallel patterns to optimize the process, and creating native kernels in OpenCL. The speaker also demonstrates an application that utilizes device fission for OpenCL and discusses memory affinity and the future of device fission on other devices.

  • 00:00:00 In this section, the speaker discusses extensions in OpenCL, specifically the three different types of extensions: KHR extensions, EXT extensions, and vendor extensions. KHR extensions are approved by the OpenCL working group and come with a set of conformance tests. EXT extensions are developed by at least two working group members and do not require conformance tests. Vendor extensions, like CL_AMD_printf, are developed by a single vendor and may only be supported by that vendor. The documentation for all extensions is available on the Chronos OpenCL registry website, allowing for transparency and accessibility across vendors.

  • 00:05:00 In this section, the speaker mentions the device fission extension for OpenCL called device vision. This extension allows the user to divide up large devices with many compute units into smaller OpenCL devices, either by name or by memory affinity. This division can help in reserving a core for a priority task or ensuring specific work groups are assigned to specific cores, like in an SMP-based system. The speaker motivates the use of device vision with an example of parallel algorithms and parallel containers built on top of OpenCL, and states that this extension is currently supported by AMD and IBM on their CPUs and Cell Broadband devices.

  • 00:10:00 In this section, the speaker discusses the importance of retaining sequential semantics while parallelizing vector pushback operations. They explain that when inserting elements in a vector during a sequential loop, the expected order would appear as a function of the loop. However, when parallelized, this order is lost, and elements can be inserted out of order, so the speaker proposes retaining the sequential semantics of the vector pushback operation. They then give an example of how this could be essential in an application such as a simplified version of an MPEG-2 stream. They conclude by introducing a C function and discussing how to implement these parallel operations on CPUs.

  • 00:15:00 In this section, the speaker explains how to use parallel patterns to optimize the Device Fission Extensions for OpenCL process. They use the pipeline pattern to execute functions in parallel and introduce a block of data to read from the input data to process one work item in a local memory. The approach then writes the offset of the mailbox to the corresponding workgroup to calculate the offsets while preserving the order of the output. The ordering is achieved through communication between workgroups instead of relying on the global ID that may execute in an arbitrary order. The pipeline pattern ensures that functions run in parallel to optimize the process.

  • 00:20:00 In this section, the speaker discusses their pipeline and how they're now using the term "counter" as a mailbox, as they've gone beyond just counting things. They explain that they are passing in group IDs for indexing into the mailboxes and using local and global memory for simple calculations. However, they note that there are no guarantees to workgroup execution and explain how this could cause a deadlock situation. To address this issue, the speaker suggests dividing the device into two separate cores using device vision, launching one workgroup on each core, and guaranteeing progress. They also introduce an important mechanism provided by OpenCL for running on a host device.

  • 00:25:00 In this section, the speaker discusses the benefits of using native kernels in OpenCL, which allows for the running of arbitrary C or C++ functions. It can offer more flexibility in trying out different implementations, as well as the ability to call standard I/O routines or other library functions that may not be available in OpenCL. The process of creating a native kernel in OpenCL involves passing in a command queue and a function along with its arguments and memory objects. However, caution must be exercised with thread-local storage as the function may not be executed in the same thread as it was included. The speaker also introduces the OpenCL C++ bindings API which offers some abstractions on top of the C API. The program begins by querying the available platforms and creating a context and device type.

  • 00:30:00 In this section, the speaker discusses the use of device fission extensions for OpenCL. Once a valid device is queried, a list of devices is returned and the first device is picked to support device fission. Device fission is new to the OpenCL API and requires extension mechanisms, which allow a partition to be described. They use the partition equally to partition it into devices of units of one. The subdevice properties are then set up, and the create subdevices function is called. Assuming at least one subdevice is created, mailboxes are made and a command queue is created for each device. The resulting devices are exactly like any other device and may be used interchangeably with existing libraries. The speaker then moves onto setting up native OpenCL kernels.

  • 00:35:00 In this section, the speaker discusses the inputs and arguments required for the implementation of Device Fission extensions for OpenCL. The speaker explains that the memory buffer for the inputs will be divided into four parts, where the pointers will be put by the FN c runtime. The memory buffer will consist of mailboxes, blocks, and cache transactions, and unique IDs will be generated for each kernel. The speaker further explains that each instance of the kernel will run on individual devices, and once all the events have completed, the data will be written out with the padding inserted. The kernel itself will include optimizations related to blocking and caching to ensure efficient execution.

  • 00:40:00 In this section, the speaker discusses the implementation of an application that utilizes device fission for OpenCL. They explain how the application works by using various types, such as input/output, mailboxes, local block arrays, block size, and group IDs, to index data sets in parallel. The application also implements simple busy waiting and blocking optimizations to ensure that everything executes as much as possible in parallel. By utilizing device fission, the implementation of this application showcases the potential for achieving speedups on the CPU with little to no ALU operations, which could increase even further with the implementation of wider vectors in the future. The speaker also discusses the other applications and use cases for device fission, such as dividing with respect to infinity and Numa space systems.

  • 00:45:00 In this section, the speaker discusses the benefits of memory affinity in OpenCL, which allows for the precise association of buffers with a particular device. This can lead to better cache locality and improved performance by avoiding contention and negative sharing. The mailbox scheme used in OpenCL can be extended to support multiple iterations, allowing for a loop that starts the waterfall pipeline again and again. The speaker also mentions the availability of resources on the OpenCL zone on developer.amd.com, where interested parties can find more information about OpenCL, including webinars, past presentations, and an upcoming summit on heterogeneous computing. The speaker also hints at the possibility of supporting device fission on the GPU in the future, which would reserve a part of the core for high-priority tasks and ensure better performance.

  • 00:50:00 In this section of the video, the speaker discusses the future of device fission moving towards other devices. Currently, only AMD and IBM support device fission extensions for OpenCL, but other vendors have shown interest in the proposal. The question is raised about whether math libraries like BLAS and FFT will be supported, and the speaker confirms that they are working on OpenCL implementations of both BLAS and different variants and implementations for linear algebra which will be presented in a FFT-style library.
 

AMD Developer Central: OpenCL Programming Webinar Series. 7. Smoothed Particle Hydrodynamics




7 - Smoothed Particle Hydrodynamics

This video discusses Smoothed Particle Hydrodynamics (SPH), a technique for solving fluid dynamics equations, specifically the Navier-Stokes equations. The video explains the different terms in the equations, including the density, pressure, and viscosity terms, and how they are approximated using a smoothing kernel. The numerical algorithm used for SPH, as well as the use of spatial indexing and Interop, is also discussed. The speaker explains the process of constructing a spatial index and neighbor map and how the physics are computed. The video invites viewers to download and use the program and discusses the limitations of the simulation. The speaker then answers questions from the audience about GPU performance, incompressible behavior, and using cached images.

  • 00:00:00 In this section, Alan Hierich, Senior Member of Technical Staff at AMD, provides an overview of computational fluid dynamics, specifically smooth particle hydrodynamics (SPH). SPH was originally used for Astrophysics calculations, but it has become quite popular in video games and simulations. The technique is used for solving the Navier-Stokes equations, which are partial differential equations formulated in the 1900s and are the basis of most fluid dynamics work today. Allen explains the definition of fluids and provides an intuitive explanation of how they work, focusing on liquids and gases. He also outlines that fluids are generally described by the Navier-Stokes equations, and the incompressible Navier-Stokes equations govern fluids like water at normal velocities and normal temperatures.

  • 00:05:00 In this section, the speaker explains the equations that govern fluids, known as the Navier-Stokes equations. The equation of motion represents the change in velocity as a function of gravity, pressure, and viscosity, while the mass continuity equation states that mass is neither created nor destroyed. The phenomena that govern a fluid are gravity, pressure, and velocity, and the viscosity is the stickiness of the fluid, which determines the likelihood of fluid particles traveling in the same direction. The convective acceleration term is also discussed, which describes the acceleration of a fluid as it moves through a smaller opening, such as a nozzle on a garden hose. The speaker invites the audience to download and play with the program that simulates fluids in a box that was demonstrated.

  • 00:10:00 In this section, the speaker explains the different terms in the equation of motion for fluid dynamics, including convective acceleration, gradient of pressure, and viscosity. The pressure is defined as the difference between the actual density of the fluid at a point and the resting density. The viscosity term, which is the last term on the right-hand side of the equation of motion, diffuses the momentum of the system, ultimately resulting in a state where the velocity is equivalent at all locations. There is also a mass continuity equation, del dot V equals 0, which implies that mass is neither created nor destroyed in the incompressible equations. Lastly, to represent the dynamics of the system, the speaker takes the material derivative of the equation to get the equation of motion for a particle.

  • 00:15:00 In this section, the video discusses the smoothed particle hydrodynamics (SPH) technique for solving the simplified incompressible Navier-Stokes equations introduced in the previous section. The SPH technique was first introduced in 1992 for studying astrophysics and galaxies, but can be used for fluid equations as well. It involves introducing smoothing kernel representations of quantities, which are like basis functions that allow us to approximate any quantity by a summation of the quantity at nearby points multiplied by a weighting function. The SPH technique is used to approximate density and pressure gradient terms in the Navier-Stokes equations. The video also mentions that the Navier-Stokes equations are numerically sensitive to scale, and that calculations are done at a smaller scale before being expanded to the regular spatial scale in order to move the particles around in space.

  • 00:20:00 In this section, the speaker explains the three main terms that are approximated in Smoothed Particle Hydrodynamics (SPH), which are the density, pressure, and viscosity terms. To calculate the density, the program computes the mass of particles at various points and multiplies that by the gradient of a smoothing kernel. The pressure term is then calculated using a scalar quantity of pressure divided by density, which is multiplied by the gradient of the smoothing kernel. On the other hand, the viscosity term is approximated using a scalar coefficient that determines the level of fluid viscosity and the difference in velocity between two points divided by the density of point J. The speaker also explains the properties of the smoothing kernel, which is used in SPH simulation, and how it sums to one over a sphere of radius H.

  • 00:25:00 In this section, the speaker discusses the numerical algorithm used for Smoothed Particle Hydrodynamics (SPH). The algorithm involves computing density, pressure, pressure gradient, viscous term, and acceleration, which are then used to time-step the velocities and positions of the particles. The speaker explains that the initial algorithm involves testing the interactions of all particles against all particles, which is correct but not fast enough. Hence, a better algorithm is introduced, which divides space into voxels, allowing interactions with only particles within the interaction radius. Additionally, a subset of particles is randomly chosen to compute interactions instead of considering all particles, which produces an efficient program.

  • 00:30:00 In this section, the speaker discusses the use of spatial indexing to only compute interactions with a limited number of particles in OpenCL simulations, as well as the importance of using Interop to share data buffers with a graphics system. While Interop allows for rendering directly from the GPU buffer and saves space in graphics memory, without it, the program would have to copy data to host memory and back, significantly slowing down the simulation. The speaker explains the modifications necessary to use Interop, including creating a different context, and introduces the set of buffers needed for the simulation, including particle index for sorting. Despite discussing the importance of Interop, the program being shown does not utilize it, which slows down the simulation.

  • 00:35:00 In this section, the speaker discusses the various kernels used in the Smoothed Particle Hydrodynamics algorithm. The first kernel is "hash particles," which associates particles with a voxel. Then, "sort" and "sort post pass" kernels sort the particles into voxels and organize them for the spatial index construction. Next, the "index" and "index post pass" kernels construct a spatial index from voxels to particles. After that, the "fine neighbors" kernel decides which neighbors will be interacting with each other. Finally, the "compute density pressure," "compute acceleration," and "integrate" kernels compute the interactions and physics between particles. The speaker explains that radix sort is used in the GPU version of the code, while Q sort is used in the CPU version of the code.

  • 00:40:00 In this section, the video explains the process of constructing a spatial index from voxels to particles in Smoothed Particle Hydrodynamics (SPH). The kernel uses a binary search to identify the lowest numbered particle in each voxel and leaves a value of negative one in voxels that do not contain particles. The index post-pass then fills in an index value for empty voxels by copying the value of the next non-empty voxel in the grid cell index. Once the indexing is done, the program constructs a neighbor map by searching through the local region of two-by-two-by-two voxels that surround each particle. To eliminate bias, the program generates a random offset into each voxel and alternates the direction of the search. The kernel then selects the first 32 particles within the interaction radius and adds them to the neighbor map.

  • 00:45:00 In this section, the speaker explains how they computed the physics by constructing a neighbor map, which allows them to interact with 32 particles. They go over the equations used to approximate density and pressure, compute acceleration terms, and then combine everything to determine total acceleration. The velocity and position are then advanced through numerical integration, with boundary conditions in place to prevent particles from escaping the box. The speaker encourages viewers to download and play with the source code, and emphasizes that while there are many slow methods for solving the Navier-Stokes equations, slow doesn't necessarily mean good.

  • 00:50:00 In this section of the video, the speaker explains the update step of the simulation, where velocity is integrated to its new position before updating the position. The update of velocity is explicit, while the update of position is semi-implicit, using the value of velocity at the next time step. The simulation is in float for performance purposes, but if high fidelity and accuracy are needed, using double would be recommended. The algorithm is fully parallelizable, but the trade-offs of spatial partitioning need to be taken into account. Finally, the speaker answers questions about using Interop with multiple GPUs, simulating turbulence, and the practical maximum number of particles in the simulation.

  • 00:55:00 In this section, the speaker answers some questions from the audience. They explain that the practical limitation of their simulation is the performance rate, which is dependent on the class of GPU and CPU used. They also mention that although their simulation is based on incompressible equations, they are not explicitly solving for the incompressibility condition, which might limit its compressible behavior. The speaker also answers a question about why they used buffers instead of cached image memory, stating that at the time they developed the program, they did not see any performance advantage to using cached images. However, they mention that OpenCL will provide support for cached buffers in the future, and that they might change the program to support them. Overall, the speaker invites the audience to download and use the program in any way they want, as there are no limitations to it.
 

AMD Developer Central: OpenCL Programming Webinar Series. 8. Optimization Techniques: Image Convolution



8 - Optimization Techniques: Image Convolution

In this video Udeepta D. Bordoloi discuss the optimization techinques in image convolution.

 

AMD Developer Inside Track: How to Optimize Image Convolution



How to Optimize Image Convolution

This video discusses various methods for optimizing image convolution including using local data share, optimizing constants, and using larger local areas to improve efficiency. The speaker emphasizes the importance of minimizing processing time in image convolution to improve overall performance and highlights a new method of reusing data using local memory. The video offers suggestions for optimization steps such as using obvious or textures, using thought force, and using the passing options to the counter. A step-by-step article about optimizing image convolution techniques is available on the developer AMD website.

  • 00:00:00 In this section, Udeepta Bordolo from AMD's graphics team explains the concept of image convolution, which involves doing a weighted sum over an area of the input image to generate an output image pixel. He uses OpenCL and a 5870 GPU to perform the optimization, gradually working up from the basic correlation code. Mirroring and using LDS (Local Data Share) are some of the optimization methods used, resulting in a significant reduction in execution time.

  • 00:05:00 In this section, the speaker discusses the optimization of image convolution in a program that works for all filter sizes and all input sizes. He focuses on using local areas to improve data sharing and reduce time spent waiting. By interpreting constants and inputs as 128-bit values, the compiler and genie can more easily interpret the code and reduce processing time. He shows how optimizing constants and using larger local areas can greatly improve efficiency in image convolution. Overall, the speaker emphasizes the importance of finding ways to minimize processing time in image convolution to improve overall performance.

  • 00:10:00 In this section, the speaker discusses how to optimize image convolution by determining the filter size and how to change it for different masks. The speaker notes that applying optimization at different times can affect the performance, but it can help in finding any unexpected issues. The speaker also discusses running the same number of elements for a 2k by 2k input image with full force data, which resulted in a more efficient data format. Additionally, the speaker highlights a new method of reusing data instead of using the motor by using local memory, known as LDS in hardware.

  • 00:15:00 In this section, the speaker talks about optimizing image convolution techniques. They load all the input, delay them by a particular group, and then work off the obvious. They use the cash as they know it and stop using LDS to try using textures. They conduct an experiment with 2K by 2K resolution and different filter sizes, and they get the fastest number when using textures. They suggest the optimization steps of using obvious or textures, using thought force, and using the passing options to the counter. They also suggest using the cash monster when possible. They have published a step-by-step article about optimizing image convolution techniques on the developer AMD website, which they link to next to the video.
How to Optimize Image Convolution
How to Optimize Image Convolution
  • 2013.05.28
  • www.youtube.com
Udeepta Bordoloi, MTS Software Engineer in the Stream Computing Group Udeepta Bordoloi walks though several different ways to optimize an image convolution a...
 

AMD Developer Central: OpenCL Technical Overview. Introduction to OpenCL



AMD Developer Central: OpenCL Technical Overview. Introduction to OpenCL

In this video, Michael Houston provides an overview of OpenCL, an industry standard for data parallel computation targeting multi-core CPUs, mobile devices, and other forms of silicon. OpenCL aims to unify previously competing proprietary implementations such as CUDA and Brook+ which will simplify development for independent software vendors. It offers a breakdown between code that runs on the device and code that manages the device using a queuing system designed for feedback with game developers. OpenCL is designed to work well with graphics APIs, creating a ubiquitous computing language that can be used for various applications such as photo and video editing, as well as for artificial intelligence systems, modeling, and physics. The presenter also discusses the use of OpenCL for Hollywood rendering and hopes to see more work in this area.

  • 00:00:00 In this section, Mike Houston explains the purpose of OpenCL as an industry standard for data parallel computation, targeting multi-core CPUs, mobile devices, and other forms of silicon. OpenCL aims to unify previously competing proprietary implementations such as CUDA and Brook+, which will simplify development for independent software vendors. Although there is a different dialect and minor differences in OpenCL, transitioning from other data parallel languages such as CUDA is direct and fast. OpenCL also offers a breakdown between code that runs on the device and code that manages the device using a queuing system designed for feedback with game developers. It is designed to work well with graphics APIs, creating a ubiquitous computing language for the consumer space in applications such as photo and video editing.

  • 00:05:00 In this section, the speaker explains some of the initial uses of OpenCL, which includes processing images or high-res video and running virus scanners. Additionally, the technology is useful for artificial intelligence systems, modeling systems, physics, post-processing, and accelerating lighting and rendering for movies. The speaker hopes to see more work on using OpenCL for Hollywood rendering, among other things.
Introduction to OpenCL
Introduction to OpenCL
  • 2013.05.29
  • www.youtube.com
Michael Houston, GPG System Architect Learn about OpenCL, what the transition to OpenCL will be like, what applications are ideal for OpenCL and what impact ...
 

AMD Developer Central: Episode 1: What is OpenCL™?



AMD Developer Central: Episode 1: What is OpenCL™?

This video provides an introduction to OpenCL and its design goals, which focus on leveraging various processors to accelerate parallel computations instead of sequential ones. OpenCL enables the writing of portable code for different processors using kernels, global and local dimensions, and work groups. Work items and work groups can collaborate by sharing resources, but synchronization between work items in different work groups is not possible. Optimal problem dimensions vary for different types of processing, and it’s important to choose the best dimensions for the best performance. OpenCL can fully utilize a system's capabilities through expressing task and data parallelism together using the OpenCL event model.

  • 00:00:00 In this section, Justin Hensley discusses the basics of OpenCL and its design goals, which primarily focuses on leveraging CPUs, GPUs, or any other processors, such as the cell broadband engine or DSPs, to accelerate parallel computations rather than sequential ones, resulting in dramatic speedups. OpenCL enables writing portable code to run on any processor type, such as AMD CPUs and GPUs, using kernels, which are similar to C functions used to exploit parallelism, and programs that are collections of kernels and other functions, with applications executing kernel instances using queues that are queued and executed either in order or out of order. The global and local dimensions in OpenCL define the range of computation, while the whole point of OpenCL is to use highly parallel devices to accelerate computation simultaneously, allowing local work groups to collaborate by sharing resources because global work items must be independent with synchronization only possible within a work group.

  • 00:05:00 In this section, we learn about work items and workgroups, and how OpenCL allows us to synchronize between work items within a workgroup using barriers or memory fences. However, work items in different workgroups can't synchronize with each other. The optimal problem dimensions also vary for different types of processing, and it’s important to choose the best dimensions for the given problem to get the best performance. In OpenCL, it's also possible to express task parallelism by executing a single work item as a task using the OpenCL event model. By allowing task and data parallelism to work together, OpenCL can fully utilize the system's capability.
Episode 1: What is OpenCL™?
Episode 1: What is OpenCL™?
  • 2013.05.27
  • www.youtube.com
In this video, you learn what OpenCL™ is and why it was designed the way itis. We go through design goals and the execution model of OpenCL™. Topicscovered i...