OpenCL in trading - page 6

 

16. OpenCL Example: Image Rotation



16. OpenCL Example: Image Rotation

This video discusses image rotation and how it can be implemented using OpenCL. Each pixel in an image has coordinates and represents a specific color scale, and the rotation involves moving the pixels to a new location based on a formula that takes into account their original and new coordinates and the rotation angle. The speaker proposes assigning each work item to calculate the new position of a single pixel and uses input decomposition to divide the whole global workspace into smaller work groups, making the operation perform more efficiently. The process of transferring an image from the buffer on the device to the buffer on the host is also explained, with emphasis on checking for errors and calculating elapsed time.

  • 00:00:00 In this section, the video discusses image rotation and the math behind it. It explains that each pixel in an image has coordinates and represents a specific color scale. The rotation involves moving the pixels to a new location based on a formula that takes into account their original and new coordinates and the rotation angle. The video proposes assigning each work item to calculate the new position of a single pixel and uses input decomposition to divide the whole global workspace into smaller work groups, making the operation more efficient.

  • 00:05:00 In this section, the speaker explains how the image is divided into work groups in OpenCL to implement image rotation. The image is divided into horizontal and vertical domains, each with 16 work groups assuming the width and height of the image are multiples of 16. The speaker then introduces the kernel function used to perform the image rotation, which takes in arguments such as the original and destination data, image dimensions, and rotation parameters. Within the kernel function, the speaker illustrates how the new location of each pixel is calculated using the rotation parameters and how pixel content is copied from the original location to the new location after performing boundary checking.

  • 00:10:00 In this section, the speaker explains how to rotate an image using OpenCL. The process involves checking the coordinates and ensuring that they are positive values within the dimensions of the image. Pixel information is then copied from the original location to the new location using calculations to determine the physical memory location. The example code listed in this section shows the differences between this implementation and the previous one of matrix multiplication. The C++ binding for the OpenCL API is used to set up the environment to query the platforms, acquire the device, create a command queue, and declare buffers for moving data. The kernel is compiled, and the parameters are set to run the kernel, which is completed by reading the results back to the host.

  • 00:15:00 In this section, the speaker discusses the process of transferring an image from the buffer on the device to the buffer on the host. The speaker explains that this process involves reading the buffer on the device using the clEnqueueReadBuffer function and specifying the size, offset, and pointer to the buffer on the host. Additionally, the speaker notes the importance of checking for errors during this process using an if statement, and demonstrates how to calculate the elapsed time for this portion of the code using the clGetEventProfilingInfo function.
OpenCL Example: Image Rotation
OpenCL Example: Image Rotation
  • 2020.06.05
  • www.youtube.com
This video explains the math behind the image rotation and the code implementation.
 

17. OpenCL Example Image Rotation Demo



17. OpenCL Example Image Rotation Demo

The "OpenCL Example Image Rotation Demo" tutorial covers the source code of the demo, which includes different folders containing C code and image files that the program will process. The video walks through creating buffers for input and output images, copying the original image to the device buffer, setting kernel arguments, executing the kernel with a global size defined as the whole image, and reading the output data back to the host. The kernel function takes rotation parameters to calculate the new coordinates of each pixel and copy the pixel information to the new location with boundary checking. The program also includes a function to store the rotated image in BMP format and frees all resources after completion. The demo successfully reads and computes the pixels of the original image to create the rotated image.

  • 00:00:00 In this section, the presenter gives an overview of the source code for the image rotation demo which includes different folders containing the main and supporting C code files, as well as the image files that the program will process. The code includes header files for Mac OS and Altera OpenCL platforms, definitions for buffers and kernel parameters, and utilises supporting functions for opening image files and converting their formats. The code also includes the creation of the output buffer and the initialisation with random numbers. Finally, the code creates a command queue, program object, and kernel object, and specifies the kernel function name.

  • 00:05:00 In this section, the speaker walks through the process of creating buffers for input and output images, copying the original image to the device buffer, setting kernel arguments, executing the kernel with a global size defined as the whole image, and reading the output data back to the host. The kernel function takes in destination and source buffer pointers, image dimensions, and rotation parameters to calculate the new coordinates of each pixel and copy the pixel information to the new location with boundary checking. The program also includes a function to store the rotated image in BMP format and frees all resources after completion.

  • 00:10:00 In this section, the speaker demonstrates an OpenCL example image rotation demo using the Avatar FPGA platform which takes the original image of a cat and rotates it clockwise by 45 degrees, producing a new image of the same size that is saved with a new name in the image rotation folder. The demo shows that it successfully reads and computes the pixels of the original image to create the rotated image.
OpenCL Example Image Rotation Demo
OpenCL Example Image Rotation Demo
  • 2020.06.05
  • www.youtube.com
This video walks through the code of Image Rotation and demonstrates the results.
 

18. OpenCL Example: Image Convolution



18. OpenCL Example: Image Convolution

The "OpenCL Example: Image Convolution" video explains image convolution, which modifies each pixel in an image using information from its neighboring pixels by applying a filter like a blurring filter. The video provides a seed implementation of the image convolution function and introduces the "image" data structure in OpenCL, which is designed for image data types, allowing for efficient processing on graphics processors. The video shows how to copy image and filter data to the device for image convolution work using OpenCL and the use of the OpenCL sampler object to access the image. The video also demonstrates how to obtain the work item and iterate through the filter rows and columns to obtain pixel information from the image object, multiply them with the filter pixels, and accumulate them to the sum variable. Lastly, the video shows how to update pixel values using an OpenCL image object.

  • 00:00:00 In this section, the video discusses image convolution, which is an operation that modifies the value of each pixel in an image using information from its neighboring pixels. This is done by applying a filter to the original image, such as a blurring filter, which takes a weighted average of neighboring pixels to reduce the differences between their values. The video demonstrates how to apply a 3x3 filter to a small region of an image and use element-wise multiplication and summation to calculate the value of each new pixel in the filtered image. However, the video notes that when applying such filtering operations to an entire image, careful consideration must be given to boundary pixels that do not have all eight neighboring pixels. The video also shows several example filters, such as blurring and edge detection, that can be applied to an image to achieve different effects.

  • 00:05:00 In this section of the "OpenCL Example: Image Convolution" video, the speaker provides a seed implementation of the image convolution function that goes through every pixel of an image and applies a filter to it. They also introduce a new data structure in OpenCL called "image" which is specifically designed for image data types, allowing for long instruction sequences that can be more efficiently processed on graphics processors. Image objects can be created with specified formats, dimensions, and other parameters. The video demonstrates how to create a 2D image buffer on the device side.

  • 00:10:00 section discusses the process of copying image and filter data to the device for image convolution work using OpenCL. The section also introduces the OpenCL sampler object which is used to describe how to access an image. The addressing mode, filtering mode, and use of normalized coordinates are specified for the sampler object. The section also shows the kernel function and the use of the "read-only" and "constant" keywords to specify object properties, which enables the OpenCL runtime to put the filter objects into the specific constant region in global memory.

  • 00:15:00 In this section of the OpenCL image convolution example, the speaker goes through the steps of obtaining the work item using get_global_id, calculating the half-width of the filter, initializing the filter index, iterating through the filter rows and columns, and using the read_image function to obtain the pixel information from the image object. The pixel information is then multiplied with the pixel in the filter and accumulated to the sum variable. This process is repeated for every pixel in the filter and allows for the use of neighboring pixel information in the image convolution.

  • 00:20:00 In this section, the video tutorial demonstrates how to update pixel values in an OpenCL image convolution using an image object. After successfully running the kernel, the next step is to use CL in Q read image to read the image back. This function takes the command Q as the first argument, the output image object, and a true value to indicate that the read should be blocked until completion. The origin and region parameters used to create the image object are provided, along with the buffer on the host side where image data is stored.
OpenCL Example: Image Convolution
OpenCL Example: Image Convolution
  • 2020.06.07
  • www.youtube.com
This video introduces the principles of image convolution and how to implement it in OpenCL.
 

19. Demo: OpenCL Example - Image Convolution



19. Demo: OpenCL Example - Image Convolution

The video explains an OpenCL image convolution example, defining different filters such as blur, sharpen, edge sharpen, detection, and embrace filter. The presenter demonstrates initializing the filter values and reading BMP image data from the file, creating input and output image objects, and setting up kernel arguments to execute the kernel. The video also teaches creating the sampler, defining how to process pixels outside the boundary, launching the kernel, storing the pixel data into a file, and creating the necessary headers for the BMP format. Finally, verifying the results by comparing the values in two buffers to create a filtered image that should match the golden result with only a slight deviation due to floating computation.

  • 00:00:00 In this section of the video, the presenter walks through the code of an OpenCL image convolution example. The code defines different filters, including a blur, sharpen, edge sharpen, detection, and embrace filter. The presenter demonstrates how to test out a specific filter, in this case, the edge detection filter. They also explain how to initialize the filter values and read the BMP image data from the file, create the input and output image objects, and set up the kernel arguments to execute the kernel. The video also covers how to create the sampler and define how to process pixels that fall outside the boundary.

  • 00:05:00 In this section, the presenter demonstrates an example of image convolution using OpenCL. The presentation shows the necessary steps to perform the filter on the host, including like launching the kernel, storing the pixel data into a file and creating the necessary headers for the BMP format, and verifying the results by comparing the values in two buffers. The objective of this exercise is to create a filtered image that should match the golden result with only a slight deviation due to floating computation. Overall, the presentation emphasizes how to compile, run and verify the output from the OpenCL kernel function.
Demo: OpenCL Example - Image Convolution
Demo: OpenCL Example - Image Convolution
  • 2020.06.07
  • www.youtube.com
This demonstrates the example of Image Convolution.
 

20. Lecture 5 OpenCL Concurrency Model



20. Lecture 5 OpenCL Concurrency Model

This lecture covers the OpenCL runtime and concurrency model, including multiple command queues, queuing model, OpenCL kernels work items, and work groups. Synchronization points are used to manage the execution of commands, and wait events are used to synchronize the commands in a device-side command queue. The lecture emphasizes the importance of asynchronous operations in OpenCL and explains the use of events to specify dependencies between commands. The lecturer also discusses the use of callback functions for event completion and highlights the importance of profiling for performance tuning. Additionally, the lecture covers the OpenCL concurrency model for multiple devices in a system, including the pipeline and parallel execution models. Finally, the lecturer demonstrates the implementation of an execution model using kernel events, which allows for parallel execution of different kernels.

The OpenCL concurrency model allows multiple work items to execute independently to improve performance, using work groups with local synchronization to achieve parallelism in execution, but too many work items can cause resource contention. Work items are responsible for maintaining their own program counters, and understanding the problem dimensions and problem sizes is important to design work items that take advantage of GPU processing elements. OpenCL uses workgroup barriers for advanced synchronization among work items, but no mechanisms support synchronization between work items in different workgroups of the same kernel execution. To synchronize work items within the same work group, the barrier API is used, but for synchronization on a global scale, events and wait events are used. The kernel function uses pointers to memory objects in the global and local memories, and local memory accessible to all processing elements can be used for data sharing within the work group. The lecture also covers native kernels, which allow using C functions as kernels on a device without relying on OpenCL compilers, passing OpenCL memory objects to a user function using the in-queue native kernel API, and built-in kernel functions, such as the motion estimation extension for OpenCL, used for image processing to estimate motion between neighboring frames in a video.

  • 00:00:00 In this section, we learn about the OpenCL runtime and concurrency model, including OpenCL commands, queuing model, multiple command queues, OpenCL kernels work items, and work groups. OpenCL is a task parallel host controlled model, and kernels are used to perform tasks parallel. The command queues are thread-safe, as multiple software threads may be running on a host and may or may not operate on the same command queue. Asynchronous operations are important in OpenCL, and data movement and other operations are executed in the queue at some point in the future. The smallest unit hosts interact with the device is the command, and the completion of a command is guaranteed only at a command queue synchronization point. This synchronization happens between commands in a host grant queue and between commands in a device-side command queue.

  • 00:05:00 In this section, the lecture focuses on the blocking read argument in QAPI and how it can be used to set up a synchronization point. By setting blocking read as true, this API can be made into a blocking read operation, which will halt other memory operations until the read operation is complete. The lecture also explains the use of events in QAPI to specify dependencies between commands, which can be useful in complex operations involving multiple commands, data transfers, and computations. In addition, events can provide more information about a command than the command itself, as commands submitted using Co in queue are asynchronously handled and cannot return error conditions or profiling data. However, by generating an event associated with a command, information about the command can be queried using the CL get event info API.

  • 00:10:00 In this section, the OpenCL concurrency model is discussed, particularly the use of synchronization points to manage the execution of commands. One way to synchronize is the use of wait events, where the host can block until a specific event happens, while the queue API has several wait events for inter-command synchronization. The barrier operation, on the other hand, is a special command used for out-of-order queues, where it becomes a synchronization point. Markers serve a similar purpose but do not block execution, instead serving as an implicit input event, with output events to notify the completion status of preceding commands. Other information about a command, such as its error conditions, can be queried using the CIL get event info API.

  • 00:15:00 In this section, the lecturer discusses OpenCL's command and event APIs, which are used to control the flow of data between host and device. He explains that the command API allows users to submit commands to the command queue, which can perform various operations such as kernel executions, buffer reads and writes, and memory mappings. By using the event API, users can query the execution status of the submitted command and apply profiling within the context. The lecturer also talks about the event callback function, which is triggered when the specific execution status of a command in a queue occurs. The callback function can be used to perform simple tasks and provide output data for the OpenCL application to leverage. The lecturer emphasizes that the callback function should be completed as quickly as possible and should not be a blocking function.

  • 00:20:00 In this section, the speaker discusses setting up callback functions to handle event completion. Setting the callback function right after declaring a completion event is not a valid location since there is no command associated with the event yet. A valid event is only created after a command is associated with it, such as when a kernel is added to the command queue. Callback functions should be thread-safe, quick to complete and not call expensive system routines. It's also important to enable profiling when it comes to performance tuning, as it helps to determine a command's execution status and time. The clq profiling enable flag is used to enable profiling.

  • 00:25:00 In this section, the speaker discusses the OpenCL profiling API, which allows users to get information about events in the queue. By enabling profiling, a cue can provide information such as start and end times for commands, which can be used to calculate execution time. The speaker also explains user events, which are not associated with a specific command and can be used in an arbitrary way. By creating a user event and having a read command wait for it, the read operation will be blocking until the user event becomes complete, allowing for more specific control over the ordering of commands.

  • 00:30:00 In this section, the speaker explains how to control the order of execution of commands and how to achieve better performance using an out of order queue. By default, the commands in a command queue are executed in order, but to achieve flexibility and better performance, an out of order queue can be used. The speaker demonstrates a concrete example of setting up an out of order queue with multiple events and buffer objects. The example includes a write operation and a read operation with non-blocking, write-only and read-only buffers. The user events and event status are also discussed. Overall, the section provides a comprehensive understanding for creating an out of order queue and optimizing the execution order of commands for better performance.

  • 00:35:00 In this section, the lecturer explains how to copy floating-point numbers from the host input buffer to the device's memory buffer using the OpenCL API. The code includes setting up the execution unit dimension, launching two kernel functions, and waiting for completion events. The lecture also emphasizes that the two kernel functions are separate commands and shows how to read output data back on the host side, waiting for events, and performing cleanup. Overall, the section demonstrates the use of four OpenCL commands to copy and process data between a host and a device.

  • 00:40:00 In this section, we see how command queues and events control the execution of OpenCL commands. By setting up dependencies using events, we can ensure that commands execute in a specific order. However, if we swap the order of commands in an out-of-order queue, it won't affect the execution of the commands. But, if we use an in-order queue and swap the commands, it can lead to deadlock.
    Therefore, it's crucial to properly set up dependencies and use the appropriate queue type to avoid potential issues. Additionally, we learn about multiple command queues and how they can be used to improve concurrency in OpenCL programs.

  • 00:45:00 In this section, the lecturer explains the OpenCL concurrency model for multiple devices in a system. Each device needs its own command queue, and it's also possible to have multiple command queues for a single device, but it's not commonly used. Generally, multiple devices using OpenCL use two different execution models - pipeline or in parallel. In the pipeline model, even though two or more devices are present, one device waits for the result from another device, forming a pipeline. On the other hand, in the parallel model, devices work independently in parallel with their buffers, executing independently. The lecture also includes an example of multiple command queues in a context on the OpenCL platform, two devices with their own command queues, and an implementation in OpenCL.

  • 00:50:00 In this section, we learn about the pipeline and parallel execution models in OpenCL. In the pipeline model, data is shared between two kernels, with kernel 0 running first and kernel 1 waiting for its completion before moving onto the next computation. We see the implementation of this model using events and API calls. In contrast, the parallel model is used when there are no dependencies between tasks, and kernels can run simultaneously on different devices. The example given has three devices, GPU devices 0 and 1, and a CPU device 2, with separate command queues for each. Kernel 0 and kernel 1 are launched separately to run in parallel, and once completed, CPU kernel 2 can start.

  • 00:55:00 In this section, the speaker discusses how to build an execution model in OpenCL by creating a kernel event, which allows for parallel execution of different kernels. The first two API calls are used to create two separate GPU queues for colonel 0 and colonel 1. Then, the third API call launches a CPU kernel, which waits for the completion of both GPU kernels before execution. The discussed approach uses events to generate dependencies between the CPU and GPU kernels but also allows the GPU kernels to execute in parallel. The speaker also explains the concepts of OpenCL work items, which define one copy or instance of computation, and processing elements, which are the smallest unit on a hardware device.

  • 01:00:00 In this section, the OpenCL concurrency model is explained, which relies on allowing multiple work items to execute independently to scale performance up. Work groups are a set of work items in a global execution space that are mapped to compute units, allowing for local synchronization as a way to achieve a certain degree of parallelism in execution. However, mapping too many work items to a compute unit can cause resource contention, which requires issuing just enough work to keep all processing elements busy all the time and finish the work groups in one batch.

  • 01:05:00 In this section of the video, the speaker discusses the OpenCL concurrency model and how work items can be grouped into hardware threads or contexts. Work items are responsible for maintaining their own program counters when mapping work items and dispatching them from the OpenCL framework. The speaker explains that work items should be designed to fulfill and take advantage of GPU processing elements. The dimensions of the problem size and the current element being worked on are important to understand. The speaker provides a set of built-in functions that can be used to understand the problem dimensions, global size, local size, and more importantly, to get the global and local ID as an index to find out what is the actual data item that needs to be worked on in the current work item. A simple kernel example is also provided to explain the different aspects of the OpenCL concurrency model.

  • 01:10:00 In this section, the lecturer explains a kernel function that calculates the address to use with the get global ID method to get the ID number in a global space for computation. The kernel function then uses the address as an index to an array, multiplies the result by two and assigns the product to a different buffer. The lecturer goes on to explain how the global ID in dimension zero is used to determine the element to compute on and how the global ID in dimension one helps to determine how many rows have been gone through so far. The idea is to ensure that each work item computes on a unique element and that there is no repetition of computations on any element, thus saving on GPU or CPU cycles.

  • 01:15:00 In this section, the video discusses how OpenCL synchronization works and why it can be difficult to guarantee hardware-based ordering. OpenCL is a cross-platform API framework that doesn't know where the program will be executed, so it cannot guarantee hardware-based ordering. OpenCL runs on devices that may support multi-threading, but there is no operating system running separately on such devices. This means there's no master or OS kernel to remove threads from an execution queue. As a result, there's a risk of deadlock when using semaphores or similar mechanisms. To perform advanced synchronization among work items, OpenCL uses a workgroup barrier. However, OpenCL does not have mechanisms to support synchronization between work items in different workgroups of the same kernel execution. To support data sharing within the work group or on the same compute, local memory accessible to all processing elements can be used.

  • 01:20:00 In this section, we learn about using the barrier operation to synchronize work items within the same work group in OpenCL. The barrier API is used as a synchronization point, and work items are not allowed to continue past the barrier until all other work items in the work group have also reached the barrier. This ensures that all work items in the group have the same data before proceeding. Using the global synchronization line in OpenCL, one can use event and wait event to synchronize at a global scale, but within the work group at the work items level, one needs to leverage the barrier operation to perform synchronization.

  • 01:25:00 In this section, we learn about the kernel function in OpenCL and its arguments, which are pointers to memory objects in the global and local memories. The local memory is present within a compute unit and is shared by the processing elements in that compute unit. The kernel function uses the global ID and local ID to read data from the global memory and assign it to the corresponding element in a local buffer. To synchronize all the work items working on the same assignment statement, the workgroup barrier and local memory offense are used as synchronization points. After synchronization, the kernel function performs operations on the data in the local buffer and stores the result in the corresponding location in the output buffer in the global memory.

  • 01:30:00 In this section, the lecture introduces the concept of native kernels in OpenCL, which allows using C functions as kernels on a device without relying on OpenCL compilers. The API for invoking native kernels is presented along with the mechanism for passing arguments to the C function using a scheme called "unboxing". The lecture provides an example of passing arguments, which involves passing constant numbers and pointers to memory objects, including OpenCL buffers, that are retrieved from the OpenCL environment.

  • 01:35:00 In this section, the lecturer explains how to pass OpenCL memory objects to a user function using the in-queue native kernel API. The API requires the number of memory objects being passed and their respective locations to be specified in a list. This includes placeholders for memory objects and constant numbers, which are differentiated from memory objects because they require interactive passing. Built-in kernels are device-specific and leverage special hardware resources, but they are not universal and may not have the same function across different devices.

  • 01:40:00 In this section, we learn about built-in kernel functions, such as the in-house motion estimation extension for OpenCL, which is used for image processing to estimate the motion between neighboring frames in a video. This extension provides hardware acceleration or embed firmware functionalities and can be explored further on the OpenCL website.
Lecture 5 OpenCL Concurrency Model
Lecture 5 OpenCL Concurrency Model
  • 2018.10.13
  • www.youtube.com
OpenCL Runtime and Concurrency Model (please refer to Chapter 5 of textbook)
 

21. Map Reduce Concept



21. Map Reduce Concept

The concept of MapReduce is explained in this video, which involves breaking down large problems into smaller subproblems using a mapping phase followed by a reduction phase. This approach is used by Google to process vast amounts of data on their computers in data centers. The video provides an example of how processors operate independently, assigning processors some data to work on, which produces key-value pairs upon completion. The key-value pairs are then processed by a group of different processors to get the final result in the reduction phase. This approach allows for efficient processing of large datasets by distributing the workload across multiple machines.

  • 00:00:00 In this section, the speaker discusses the concept of MapReduce in the context of decomposing large problems to obtain smaller ones that can be addressed efficiently. They identify two main strategies for decomposition, namely divide and conquer and scatter and gather, which are used interchangeably depending on the properties of the problem or hardware limitations. They explain that the MapReduce approach, which was first proposed by Google in 2004, partitions large problems into smaller subproblems and processes them individually using a mapping phase followed by a reduction phase. They illustrate how the concept is used by Google to process vast amounts of data on their computers in data centers, which serve search engine queries and web pages.

  • 00:05:00 In this section, the MapReduce approach is explained through two stages: mapping and reduction. The first stage involves assigning processors some data to work on, which will result in key-value pairs being produced upon completion. The second stage, known as reductions, involves processors receiving key-value pairs from the first stage and performing reduction on values corresponding to a given "T." This approach is efficient as processors can operate independently once the original problem is partitioned into subproblems, making it scalable to problems of different sizes. The video gives a concrete example of how this approach can be applied using an input with six blocks assigned to processors, illustrating the potential efficiency of this approach.

  • 00:10:00 In this section, the concept of map reduce is provided and it is explained how processors perform addition to combine the original input into six pieces of equal size to ensure balance. Each key value pair is produced almost at the same time during the mapping process by different processors. The next stage includes processing these key value pairs by a group of different processors to get the final result. Finally, the reduction stage combines these key value pairs, and processors build the list of some links based on the single keyword value of one.

  • 00:15:00 In this section, the speaker explains the process of mapping in MapReduce, which involves breaking down tasks into smaller subtasks that can be executed in parallel. This is done by creating a list for each key-word that needs to be analyzed, with each item in the list representing a value associated with that key-word. This allows for efficient processing of large datasets by distributing the workload across multiple machines.
Map Reduce Concept
Map Reduce Concept
  • 2020.06.14
  • www.youtube.com
This video introduce the basic concept of MapReduce
 

22. Map Reduce Example: WordCount and Weblink



22. Map Reduce Example: WordCount and Weblink

This YouTube video demonstrates how MapReduce can be applied to count the occurrences of each word in a large text file and analyze web page URL relationships. MapReduce allows for each processor to target specific keywords independently in the mapping stage, which involves splitting up the document into smaller sections. The reduction stage involves grouping key-value pairs based on the word key and summing up the values to get the total number of appearances for each word. For web page analysis, the mapping process involves creating key-value pairs with the URL as the key and a list of linked web pages as the values, and the reduction stage builds the final map to show the relationship between web pages.

  • 00:00:00 In this section, we see a concrete example of MapReduce in action: counting the occurrences for each word in a large text file with hundreds of thousands of English words. It is an ideal use case for MapReduce because the different keywords can be counted independently, with each processor targeting specific keywords. The mapping stage involves splitting the document into smaller pieces, with each processor receiving a similar-sized section to count the occurrences of the key-value pairs of interesting words. The reduction stage involves grouping all the key-value pairs based on the word key and summing up the values to get the total number of appearances for each word. The example demonstrates the versatility of MapReduce where it can be used to count web page accesses by implementing the URL as a keyword or analyzing different websites' web link relationships.

  • 00:05:00 In this section, the speaker explains how MapReduce can be used to analyze a large number of web page URLs and their links to build a graph that shows the relationship between them. The mapping process involves grouping the tuples into smaller chunks and assigning them to mapping processors to build a key-value pair with the URL as the key and a list of web pages as the values that will be linked from the key. Then in the reduction stage, multiple keys can be handled by different processors and the key-value pair is used to build the final map and show the relationship between the web pages.
Map Reduce Example: WordCount and Weblink
Map Reduce Example: WordCount and Weblink
  • 2020.06.14
  • www.youtube.com
This video introduce MapReduce concept with two examples: word count and web link relationship.
 

23. Considerations of MapReduce on OpenCL device



23. Considerations of MapReduce on OpenCL device

The discussion in this YouTube video centers on the use of MapReduce on OpenCL devices, with a focus on memory structure, work organization, and local/global reduction. The speaker notes the advantage of leveraging numerous processing elements on OpenCL devices and emphasizes the use of different memory hierarchies while using MapReduce to process large datasets efficiently. They also detail the five steps involved in the use of MapReduce on OpenCL devices, covering the mapping process, local reduction, synchronization of work items, global barriers, and final result production.

  • 00:00:00 In this section, the focus is on how MapReduce on OpenCL devices differs from traditional CPU environments. The main difference lies in the memory structure, which is hierarchical and consists of private, local, and global memory, each with different sizes and data visibility. Private memory is only visible to the work item itself, while local memory is visible to all work items in the compute unit, but not across units. Global memory, on the other hand, is visible to all work items. Leveraging the large number of processing elements on OpenCL devices presents an opportunity, but the memory structure must be taken into account.

  • 00:05:00 In this section, the speaker discusses the use of MapReduce for performing operations on OpenCL devices in order to address problems like the OAKLAND dataset. They explain the need to use different work items and assign computations to each element in the matrix to be worked on. Using memory hierarchy is important, including primary and local memory, and the speaker notes the benefit of using local memory for some of the reductions instead of global memory. They detail how work groups are organized and how mapping operations are performed for each data chunk assigned to a work item. The use of MapReduce in this way can help process large datasets efficiently using OpenCL devices.

  • 00:10:00 In this section, the speaker discusses the concept of local and global reduction in MapReduce on an OpenCL device. The local reduction is done on key-value pairs produced by each processing element, while the global reduction combines the results of multiple work groups in global memory. The challenge in implementing MapReduce on OpenCL devices is that the number of key-value pairs produced by the mapping stage is unknown, and there is a need to allocate memory spaces for intermediate results properly. Additionally, local interaction results must be stored in local memory, and different memory hierarchies must be considered.

  • 00:15:00 In this section, the speaker discusses the five steps for MapReduce on OpenCL devices. The first step involves the work item group performing the mapping process and a portion of the local reduction stage. The second step includes executing a local barrier to synchronize work items before performing reductions. The third step is having the local ID zero work item in each group pre-process the key-value pairs produced by other work items. The fourth step involves executing a global barrier to prevent further execution until all workers finish. Finally, the fifth step involves combining the results from each workgroup to produce the final result, ensuring that all work items have finished before doing so.
Considerations of MapReduce on OpenCL device
Considerations of MapReduce on OpenCL device
  • 2020.06.14
  • www.youtube.com
This video introduces specifics of implementing MapReduce on OpenCL devices.
 

24. MapReduce Example: String Search with Demo



24. MapReduce Example: String Search with Demo

The video demonstrates various aspects of OpenCL programming and MapReduce, with a focus on implementing string search. The speaker explains how to declare and allocate memory using the local qualifier and points out that dynamic memory allocation is not allowed in the kernel function. They also introduce vector data types and demonstrate how they can simplify element-wise addition and memory access. The main focus is on implementing string search using MapReduce, where the input text is divided into work items and assigned to a map function to search for a keyword. Each work item carries out this process while comparing the chunks of text with a pattern vector. Local results are obtained by atomic increment to prevent collisions, and the final result is obtained by aggregating the results from each work item. The speaker also provides a detailed explanation of the kernel function, including arguments required and how it is initialized.

  • 00:00:00 In this section, the speaker discusses local memory allocation in OpenCL and demonstrates how it can be done in the context of a string search kernel function. Local memory is declared using the "local" qualifier and pointed to using an inner square bracket. The speaker notes that dynamic memory allocation is not allowed in the kernel function, so the set kernel argument API is used in the host program to allocate local memory space. Additionally, the speaker introduces the vector data type, which allows for faster and easier operations on arrays with multiple elements of the same type.

  • 00:05:00 In this section, the speaker explains how to use the vector data type in OpenCL kernel function for element-wise addition. Using vectors simplifies the addition and assignment process, and the operation is applied to every single element in the vector. Initialization of vectors can also be done explicitly with specific values or by repeating a single value for all elements in the vector. The speaker also discusses how the compact memory access benefits from the vector type, especially when working with an array of characters, where each element is a letter or a byte.

  • 00:10:00 In this section, the speaker explains how to use MapReduce to implement string search with a demo. The search is carried out on an input text using a character pattern of 16 characters. They explain how the speed is utilized by using the first eight letters from the message vector, and also demonstrate how to access atoms in a different order. They elaborate on how they assign values to portions of the character vector, give examples, and explain how to build the kernel function that searches the input text, including the arguments required by that function and how it is initialized. Finally, they explain how global results are obtained and what those results signify.

  • 00:15:00 In this section, the speaker explains the initialization process and synchronization point within the local memory fence for MapReduce example's string search. The characters within the original document are divided into groups, with each group assigned a specific set of letters to search for the corresponding keyword. The global ID and character per item are used to identify the starting point for each group's assigned work. Overall, this section provides insight into the technical aspects of MapReduce and its application in string searching.

  • 00:20:00 In this section, the speaker describes the implementation of MapReduce to perform a string search using a demo. The input data is divided into work items, and each work item is assigned to a map function to search for a keyword in a chunk of text. The speaker demonstrates how the chunks of text are loaded into the memory and the window of characters is moved to search for the keyword. A vector loaded with 16 characters is used for comparison, and the result is stored in another vector for the chunk. This process is repeated for each work item, and the final result is obtained by aggregating the results from each work item.

  • 00:25:00 In this section, the video explains how the battery is assigned to a specific chapter through the comparison of the text vectors corresponding to keywords entered by the user with the pattern vector. The comparison operation compares every element of the two vectors, generating 1 or 0 for each element. If there's a keyword, the result of the comparison determines the value of the battery assigned to that specific chapter. The significant bit of the first four elements of the check vector determines if there's a keyword. If the condition holds true, which means that all the significant bits are 1, then a keyword is found and the local result is incremented, which indicates that the program has found a match.

  • 00:30:00 In this section, the speaker explains how a String Search MapReduce example works. The code consists of several workgroups where each group checks for specific keywords within the text that has been partitioned into blocks. The goal is to find if any of the keywords exist in each block. If the result is true, the comment is incremented using atomic increment to prevent collisions. After all the keywords are checked in a block, a barrier is employed to ensure that any operation which accessess global memory has been completed. The loop then shifts the window to the right to check the next set of 16 letters, and the process is repeated for each workgroup. Finally, the local result counter is added to the global result using atomic add to prevent collisions.

  • 00:35:00 In this section, the speaker explains the implementation of a MapReduce Example for String Search with Demo. The kernel function is explained, where local memory is initialized, counters are reset, and the program uses Pope ID to calculate the offset and load 16 bytes at a time to the text vector. The function then compares the results and performs global reduction using the first of all and reduction function. The example code on GitHub is shown, and the speaker outlines the arguments, such as patterns, text factor, text buffer, character for an item, total length of the original text, and size of the image. Finally, he talks about how global size depends on the available resources and the formal science involved in the process of finding the desired string.

  • 00:40:00 In this section, the lecturer explains the example of utilizing MapReduce to search for specific words in a given text file. The goal is to count the occurrence of four certain words in the text file. The lecturer suggests reviewing the source code and runtime for a better understanding of the process.
MapReduce Example: String Search with Demo
MapReduce Example: String Search with Demo
  • 2020.06.14
  • www.youtube.com
This video explains the implementation of string search on OpenCL device using MapReduce approach. Vector instructions and operations are briefly explained. ...
 

25. OpenCL Example: Radix Sort



25. OpenCL Example: Radix Sort

In this video, the concept of radix sort is introduced, which involves dividing a larger sorting problem into smaller subsets based on numerical representation instead of the actual values of the elements being sorted. The speaker demonstrates through an example of sorting eight numbers, sorted by their least significant digit in hexadecimal representation. The OpenCL shuffle and shuffle2 functions are used to efficiently rearrange elements during the sorting process. The video also explains how to perform a shuffle operation using OpenCL and how to use shuffle instructions in the kernel function for radix sorting. Furthermore, the video explores the kernel function called radix sort eight sort eight, which effectively sorts arrays in OpenCL by splitting the input vector into zeros and ones buckets based on the values in its binary digits.

  • 00:00:00 In this section, the video introduces the concept of radix sort, which divides a large sorting problem into smaller ones, sorted by subsets. The sorting process is performed based on the numerical representation of the elements, rather than the values of the elements themselves. The speaker demonstrates through an example of sorting 8 numbers, where the sort is based on the least significant digit in hexadecimal representation. The sorting process is done through several iterations, where the numbers are placed into buckets according to the most insignificant digits and repeated for further digits until the numbers are sorted in ascending order.

  • 00:05:00 In this section of the video, the speaker demonstrates how to apply radix sort to a set of numbers by first organizing them into buckets based on their most significant digit. They then repeat the process using the least significant digit and continue until they have examined all digits of each number. The speaker notes that each step of radix sort is essentially a shuffle, and using OpenCL's shuffle functions allows for the efficient rearrangement of elements during the sorting process. Lastly, the speaker provides an example of how a mask can be used to specify the indices of the elements that need to be sorted and demonstrates how to apply this concept with two vectors and a mask array.

  • 00:10:00 In this section, the use of the shuffle function and the shuffle2 function in OpenCL is explained. The shuffle function creates an output vector that shifts or shuffles the original values from the input vectors based on the given mask to create a new output vector. The shuffle2 function is similar but takes two input vectors instead of one. The size and datatype of the mask must match the return factor, and the datatype must be an unsigned integer type. Additionally, the returned factor will have the same datatype and number of components as the input vector, and only a select number of bits in the mask vector's components are significant. The value of K in shuffle2 depends on the number of components in the input vector.

  • 00:15:00 In this section, the speaker explains how to perform a shuffle operation using OpenCL. They provide an example of a shuffle operation that takes an input vector and a mask, and outputs a vector based on the values selected from the input vector by the mask. They also explain how to use the mask values to determine which values to select from the input vector to build the output vector. The speaker provides a second example that involves two input vectors containing letters, and explains how to use the mask to build the output vector based on the selected values from the input vectors.

  • 00:20:00 In this section, the video discusses using shuffle instructions in the kernel function of the OpenCL example for radix sorting. The kernel function is named "shuffle test" and takes two arguments: a floating-point vector and a character 16-element vector. The video shows an example of using an integer vector as a mask to build an output vector containing eight floating-point numbers using a four-element floating-point vector as input. The video goes on to explain how sorting works in the kernel function by looking at the least significant binary digit and using that to pick out the odd numbers to create a new vector.

  • 00:25:00 In this section of the video, the speaker explains how to build a mask based on the least significant digit of the elements in a vector to sort them using radix sort algorithm. By copying the elements to designated vectors and building a mask based on the least significant digit of the elements, shuffle to function can then be used to retrieve the values from the original and designated vectors and build a new vector. This process sorts the basic numbers into two buckets, 0 and 1, based on their binary digits.

  • 00:30:00 In this section, we learn about the kernel function called radix sort eight sort eight, which works by splitting the input vector into zeros and ones buckets based on the values in its binary digits. The kernel function defines a union structure used as a buffer to sort the data array using two counters, zero count and one count. The CMP value one is used to determine which binary digit to compare the values, and the function uses two arrays, mask ones, and data, corresponding to the diagram in the video, to store the sorted elements. The for loop has eight iterations to work on the eight numbers to be sorted, and J goes from zero to eight to determine how many elements should go into the zero and ones buckets. Radix sort eight sort eight is an effective way to sort arrays in OpenCL.

  • 00:35:00 In this section, the speaker explains how the radix sort works, which involves sorting values in a data set according to their corresponding digit at each position. The process starts with the least significant digit and moves towards the most significant digit. The speaker uses an example to show how the values ending with ones and zeroes are separated into different arrays before performing a shuffle operation to create the final output vector. The process is repeated for the other digits, and the final result is stored in a global buffer.
OpenCL Example: Radix Sort
OpenCL Example: Radix Sort
  • 2020.03.26
  • www.youtube.com
OpenCL Example: Radix Sort using Shuffle Function