OpenCL in trading - page 8

 

36. Execute Instructions on CPU Datapath



36. Execute Instructions on CPU Datapath

The video explains how computations are executed on a CPU datapath using an example of performing accumulation operations. The datapath includes load and store units to load and store data to memory using addresses, and functional units such as ALUs to perform operations. The video illustrates the process step by step, including loading data from memory, performing operations, and storing results back into memory. The speaker also explains how FPGA can be utilized to implement the same function, making the most of available resources in the hardware.

  • 00:00:00 In this section, the video explains how computations are mapped onto a PGA using an example of performing accumulation operations. First, high-level code is converted into assembly language using CPU instructions, and intermediate values are stored in registers. The CPU is a pipeline CPU with functional units in the datapath, including load and store units to load and store data to memory using addresses. The data path is designed to be general enough to execute all kinds of instructions within a fixed data width and number of operations, and a constant value can be loaded into a register via the ALU. The video also provides an example of how six instructions are executed in the CPU, illustrating the process step by step.

  • 00:05:00 In this section, the speaker goes through several instructions and explains how they are executed in a CPU datapath, including loading data from memory into register files, using different functional units in the datapath to perform operations such as multiplication and addition, and storing results back into memory. The speaker then goes on to explain how FPGA can be used to implement the same kernel function by unrolling the CPU hardware and using the exact resources required for the function, making the most of the available resources in the FPGA hardware.
Execute Instructions on CPU Datapath
Execute Instructions on CPU Datapath
  • 2020.07.04
  • www.youtube.com
This video reviews how instructions are executed on a traditional CPU data path, which will be contrasted with the mapping to a customized FPGA design. Ackno...
 

37. Customized Datapath on FPGA


37. Customized Datapath on FPGA

The video explains using an FPGA to implement the kernel function for improved performance by unrolling CPU hardware and customizing the datapath on the FPGA. By removing unused units, loading constants and wires, and rescheduling some operations, load operations can be performed simultaneously to increase performance. The design of customized datapaths can improve throughput, reduce latency and power consumption by selecting necessary operations and data for a particular function. The video shows an example of demand-wise addition on two vectors, with the result stored back in memory using registers between stages to allow for efficient pipeline and launching of eight work items for back-to-back additions.

  • 00:00:00 In this section, the concept of using FPGA to implement the kernel function with improved performance and resource utilization is explained. The idea is to unroll the CPU Hardware and use FPGA resources to create the design that implements the required function while using the resources that are not used in every step of execution. By removing certain unused units, loading constants, and wires and rescheduling some operations, the load operations can be performed simultaneously increasing the performance. Customizing the data path on the FPGA can achieve the same result using specialized dedicated resources.

  • 00:05:00 In this section, the speaker discusses the design of a customized datapath on FPGA by selecting the necessary operations and data for a particular function, memory size, and configuration in a way that could improve throughput, reduce latency, and power consumption, using a kernel function that performs demand-wise addition on two vectors, with the result stored back in memory. Leveraging the registers between stages, the datapath can use efficient pipeline and launch eight work items for back-to-back additions, allowing each cycle to process different threads to avoid idle units.
Customized Datapath on FPGA
Customized Datapath on FPGA
  • 2020.07.04
  • www.youtube.com
This video explains how to map OpenCL program onto a customized design in FPGA.Acknowledgement: the slides are from Intel's "OpenCL for FPGA" tutorial at ISC...
 

38. OpenCL for FPGA and Data Parallel Kernel



38. OpenCL for FPGA and Data Parallel Kernel

The video explains how OpenCL enables FPGA engineers to use software engineering resources to expand the number of FPGA application developers by taking advantage of the parallel computing resources on FPGAs. OpenCL's programming model enables the specification of parallelism by using data parallel functions called kernels, and each kernel relies on identifiers specified by "get global ID" to perform parallel computations on independent data segments. The concept of threads and work groups is introduced, where threads access different parts of the data set, partitioned into work groups, with only threads within the same work group able to share local memory. With this programming model, OpenCL allows for efficient data parallel processing.

  • 00:00:00 In this section, the speaker introduces OpenCL and its significance in designing FPGA based applications. Although there are fewer programmers for FPGA than standard CPUs, OpenCL as a high-level programming language expands the number of FPGA application developers by allowing FPGA engineers to use software engineering resources to take advantage of the parallel computing resources on FPGAs. OpenCL is an industry standard for heterogeneous computing and allows programmers to use familiar C or C++ APIs to write programs for executing complex workloads with hardware accelerators such as multi-core processors, GPUs, and FPGAs. The big idea behind OpenCL is its execution model which allows parallelism to be specified explicitly.

  • 00:05:00 In this section, the programming model of OpenCL for FPGA and data parallel kernel is explained. The video describes the structure of an OpenCL framework with a host and an accelerator or device running on separate hardware domains. The host prepares the devices and kernels and creates commands necessary to be submitted to these devices. The accelerator code is written in OpenCLC, and the host communicates with it through a set of OpenCL API calls, allowing for an abstraction of communication between a host processor and kernels executed on the device. OpenCL kernels are data parallel functions used to define multiple parallel threads of execution, each relying on identifiers specified by "get global ID." These IDs specify the segments or partitions of data that a kernel is supposed to work on, allowing for parallel computations to be performed on independent pairs of data.

  • 00:10:00 In this section, the concept of threads and work groups is introduced, where threads can access different parts of the data set and are partitioned into work groups. Only threads within the same work group can share local memory. Each thread has a local and global ID, and the global ID can be calculated using the group ID and local size. This system allows for efficient data parallel processing.
OpenCL for FPGA and Data Parallel Kernel
OpenCL for FPGA and Data Parallel Kernel
  • 2020.04.19
  • www.youtube.com
A recap of OpenCL for FPGA, how kernels identify data partition
 

39. OpenCL Host Side Programming: Context, queues, memory objects, etc.



39. OpenCL Host Side Programming: Context, queues, memory objects, etc.

This video tutorial explores various host side programming concepts in OpenCL, with a focus on context, queues, and memory objects. It covers the two new APIs in OpenCL, clCreateKernelsInProgram and clSetKernelArg, which are used to create kernel objects and pass arguments to kernel functions. The tutorial also discusses the use of clCreateImage API to create image objects, and how image pixels are stored in memory using channel order and channel type. It explains how OpenCL handles 2D and 3D images, how developers can gather information about memory objects using APIs such as clGetMemoryObjectInfo, and how to perform memory object operations such as read and write buffer rec, mapping memory objects, and copying data between memory objects.

  • 00:00:00 In this section, the host side programming concepts of OpenCL are revisited. The section focuses on context, queues, and memory objects. Multiple contexts can be built on a physical platform even if it consists of devices from different vendors. Memory objects in global memory can be shared by multiple queues, but appropriate synchronization needs to be performed by the application on the host side. It is possible to have multiple contexts and multiple command queues within one context. The different OpenCL platforms provided by vendors are not necessarily compatible, and hence cannot be placed in the same context.

  • 00:05:00 In this section, the video discusses two new APIs in OpenCL. The first API allows for the creation of kernels for every function in an OpenCL program using the clCreateKernelsInProgram function. This creates an array of kernel objects that can be used to verify kernel function names and other related information using the clGetKernelInfo function. The second API, clSetKernelArg, is used to instantiate kernel arguments and takes the kernel object and index of the argument as arguments. The video goes on to explain how to use these APIs and how to release kernel objects after use.

  • 00:10:00 In this section, we learn about how the API can pass argument values to kernel functions. We can pass a primitive data type as a pointer to the kernel function, or we can pass a pointer to a memory object or a sample object containing complex data. Image objects are a special type of memory object used to hold pixel data. We can create image objects using the same configuration flags as buffer objects, and with formats defined from a list of image formats. The clCreateImage API is used to create image objects, and its parameters are similar to those used for creating buffer objects. The third argument identifies the format properties of the image data to be allocated, while the fourth argument describes the type and dimensions of the image.

  • 00:15:00 In this section, the use of clCreateImage() API is introduced to identify how image pixels are stored in memory. The image object format is designed to store an image in memory and consists of two factors: channel order and channel type. The channel order identifies how channel information is stored for each pixel, and it is an enumerated type that contains basic colors and alpha information. In contrast, the channel type specifies how image channels are encoded in binary and uses different values to determine the representation of the color information. Bit levels are essential in specifying how many bits to use for representing the color value in the channel. Moreover, the memory layout of image formats is demonstrated, such that for each pixel, the RGBA sequence is stored in memory, using one byte to encode the color information for each color channel.

  • 00:20:00 In this section, the video discusses how OpenCL handles 2D and 3D images, which can consist of multiple slices stacked together in another dimension. The CL image descriptor is used to describe how the image objects are laid out, and includes parameters such as the image width, height, and depth in pixels, as well as the scanline pitch in bytes. Additionally, the clCreateImage() API is used to identify the number of bytes needed to describe the image, which may require adjustments for padding and alignment within the rows and slices.

  • 00:25:00 In this section, the speaker explains how to gather information about image and memory objects in OpenCL using APIs such as clGetImageInfo and clGetMemoryObjectInfo. These APIs allow developers to obtain information about things like image format, pixel size, pixel width, pixel height, depth, and other properties of memory objects. Additionally, they can use EnqueueReadBuffer/EnqueueWriteBuffer to read or write data to buffer objects, and EnqueueReadImage/EnqueueWriteImage to access image object data. The use of origin, region, row pitch, and slice pitch are also specific to image objects, which are organized in terms of rows, slices, and pictures. Developers can use these APIs to specify the exact location of a region they want to access or perform a copy operation, and to generate events using CL event arguments.

  • 00:30:00 In this section, the video explains two memory object operations in OpenCL: read and write buffer rec, and mapping memory objects. With read and write buffer rec, the user specifies the origin and the size information, allowing for data to be retrieved or written at specific points. Mapping memory objects allows mapping of a memory object on a device to a memory region on the host. Once mapped, memory objects can be read and modified on the host side using pointers obtained through memory mapping APIs. The video also goes through a list of memory object operations available in OpenCL for copying data between memory objects, simplifying programming on the host side and improving read and write operation performance.

  • 00:35:00 In this section, the speaker discusses the various memory objects in OpenCL and how they can be used to copy data from one location to another. The copy functions include copy buffer, copy image, copy buffer rectangular, and so on. The speaker shows a host-device system and demonstrates how to copy data from one buffer to another using the CL in queue copy buffer function. They explain how to map the buffer to a memory space using cl enqueu map buffer and then use memory copy to copy the mapped region to itself.
OpenCL Host Side Programming: Context, queues, memory objects, etc.
OpenCL Host Side Programming: Context, queues, memory objects, etc.
  • 2020.03.27
  • www.youtube.com
OpenCL Host Side Programming: Context, queues, memory objects, etc.
 

40. HDL Design Flow for FPGA



40. HDL Design Flow for FPGA

This video explains the process of developing Field Programmable Gate Arrays (FPGAs) using the Quartus design software.

The design methodology and software tools for FPGA development are explained. The typical programmable logic design flow starts with a design specification, moves on to RTL coding, and then RTL functional simulation, which is then followed by synthesis to translate the design into device-specific primitives. Engineers then map these primitives to specific locations inside a particular FPGA and verify the performance specifications through timing analysis. Finally, the design is loaded into an FPGA card and debugging tools can be used to test it on hardware. For Intel FPGAs, Quartus design software is used to perform the design flow, beginning with a system description and moving on to logic synthesis, place and route, timing and power analysis, and programming the design into the actual FPGAs.

HDL Design Flow for FPGA
HDL Design Flow for FPGA
  • 2020.04.18
  • www.youtube.com
(Intel) FPGA Design Flow using HDL
 

41. OpenCL data types and device memory



41. OpenCL data types and device memory

The video discusses OpenCL data types and device memory. It covers boolean, integer, and floating-point types and explains specific data types used to operate on memory addresses such as int-ptr, uint-ptr, and ptrdiff-t. It also explains vector data types, which are arrays containing multiple elements of the same type that allow operators to be applied to every element at the same time, and how to use them. The video provides various examples of how to initialize and access elements in a vector, including using letters and numerical indices, high-low, and even-odd. It also explains memory alignment and how to use set kernel argument and private kernel arguments.

  • 00:00:00 In this section, the video provides an overview of the data types that can be used in OpenCL kernel programming, which includes boolean, integer, and floating-point types. Specific data types such as int-ptr, uint-ptr, and ptrdiff-t are used to operate on memory addresses. The video notes that the double type is supported only if the targeted device supports the CL Cronus 14-point-64 CLCronus FP 64 extension. Developers can check this extension before using the double type in their OpenCL kernel programs. The video also explains how to enable the double type extension and use it in a kernel program.

  • 00:05:00 In this section of the video, the OpenCL data types and device memory are discussed. The OpenCL standard doesn't mandate the endian order for data types. Little-endian and big-endian are the two optional endian types that depend on how a computer architecture defines how multiplied variables will be stored in memory. The endian orders of a device can be found out using CR get device info vector data type. Additionally, vector data types were introduced as arrays that contain multiple elements of the same type, have fixed length, and allow operators to apply on every element at the same time. The benefits of using vector data types are that it's faster and simpler than using arrays. The video explains how to use vector data types to perform element-wise addition on multiple arrays.

  • 00:10:00 In this section, we learn about different vector types that can be used in OpenCL, which are very similar to scalar types. However, vectors require a number at the end to indicate how big the vector is and what type of elements it contains. OpenCL also has two special data types, double and half data type, which may or may not be supported by the device. To know the preferred vector size for different types, OpenCL provides an API that can be used to query a device's preferred vector width. Based on this, we can set options for building our program, like defining a floating-point vector for a preferred size of 128 or defining a floating-point 8 if the preferred vector size is 256. Vectors can be initialized by assigning initial values within parentheses. We can even initialize a vector using smaller vectors, for instance, if we have two vectors of size two, A and B, both initialized with scalar values.

  • 00:15:00 In this section, the speaker explains how to initialize and access elements or components in an OpenCL vector. Users can initialize a vector using smaller vectors, a combination of scalars and smaller factors, or by directly assigning the values into vector elements. Examples are provided showing how to use number indexing, letter indexing, and high-low even-odd to access elements in the vector. Different examples demonstrate how to retrieve subsets of elements from a vector and assign these elements to other variables.

  • 00:20:00 In this section, the speaker discusses various methods for indexing and modifying vector elements, including using numerical indices, letters (such as X, Y, Z, and W) to represent the different dimensions of a vector, and combinations of letters and numerical indices. They also explain how to use high-low and even-odd to select a subset of vector components based on their position in the vector. These indexing and modification methods can be useful for working with vectors of different lengths and dimensions.

  • 00:25:00 In this section, we learn about different methods for accessing and modifying elements in a vector, such as high-low even indexing, which allows us to get either the higher or lower half of elements in a vector, or even-indexed elements. We also explore how vectors are stored in a little endian device, where the least significant byte of an integer is stored in a lower address than the most significant byte. This means that in a vector of unsigned integers, the four bytes of each 32-bit integer value will be stored in order from least to most significant byte, with each full element of the vector taking up 16 bytes.

  • 00:30:00 In this section, the speaker discusses how OpenCL data types and device memory are stored in little-endian versus big-endian devices. They show how a four-element vector of unsigned integer type is stored in memory on both types of devices, noting that the order of the bytes is different due to the way little-endian and big-endian devices store the least significant and most significant bytes. The speaker also demonstrates a kernel function called "vector bytes" that retrieves individual bytes from this memory using pointers.

  • 00:35:00 In this section, the concept of memory alignment in OpenCL data types is discussed. It is explained that memory typically aligns on 32-bit structures such as integers and floating points that are always stored at memory addresses that are multiples of four. It is also noted that 64-bit structures such as long and double are stored at addresses that are multiples of eight, and the smallest power of two greater than or equal to the data size sets the memory alignment of the data structure. Additionally, the process of initializing local and private kernel arguments in OpenCL is discussed, and it is explained that kernel arguments in local and private spaces can be configured using SetKernelArg, but the last parameter value can't be set when the specifier is local. Furthermore, it is noted that private kernel arguments must be simple primitives, such as integers and floating points.

  • 00:40:00 In this section, the video discusses how to use set kernel argument and private kernel arguments in your OpenCL program. When using set kernel argument, the first argument should be the index of the integer size of the argument, followed by a pointer to the variable. Private kernel arguments can also be vectors, such as an array of float four, which can only be used in kernel functions and passed in using set kernel argument.
OpenCL data types and device memory
OpenCL data types and device memory
  • 2020.03.31
  • www.youtube.com
Data types specific to OpenCL kernel functions and their layout in device memory.
 

42. OpenCL vector relational operations



42. OpenCL vector relational operations

The video discusses OpenCL kernel programming and its operators and built-in functions. The focus is on relational operators and how they work with scalar and vector values. An example kernel function, "op test," is presented that performs an element-wise AND operation between a constant and a private vector. The video explains how to implement a vector with relational operations in OpenCL by comparing specific vector elements with a scalar using logical operations. The resulting vector can be used in a while loop to create a final output vector that is assigned to the output memory object.

  • 00:00:00 In this section, the video introduces OpenCL kernel programming and discusses the operators and built-in functions inherited by the language from other high-level languages. The operators presented include arithmetic, comparison and logic, bitwise, and ternary selection. In particular, the section focuses on relational operators and explains how they work with both scalar and vector values. The segment also provides an example kernel function called "op test," which employs the relational operator to execute an element-wise AND operation between a constant vector and a private vector initialized with initial values.

  • 00:05:00 In this section, the speaker explains how a vector with relational operations can be implemented in OpenCL. Using the example of comparing specific elements of a vector with a scalar value using logical operations, the speaker shows how a resulting vector can be created with true and false values represented as -1 and 0, respectively. The resulting vector can then be used in a while loop where the individual elements undergo further logical operations to create a final output vector, which is assigned to the output memory object.
OpenCL vector relational operations
OpenCL vector relational operations
  • 2020.04.03
  • www.youtube.com
vector relational operations
 

43. OpenCL built-in functions: vloadn, select



43. OpenCL built-in functions: vloadn, select

The video covers two key OpenCL built-in functions: vloadn and select. Vloadn allows you to initialize batches with values from a scalar array and takes two arguments: offset and a pointer to the scalar array. Select, on the other hand, allows you to select certain elements from two batches and use those to create a new vector. It can contain signed or unsigned integer values, and only the most significant bit in the mask elements matters. The tutorial demonstrates how these functions work in practice.

  • 00:00:00 In this section, we learn about Vloadn, a built-in function used for initializing batches using values from a scalar array. Vloadn takes two arguments: offset and a pointer to the scalar array. Offset determines which elements of the array are placed in the batch, given in terms of the size of the vector. Additionally, we learn about the Select function, which can be used to select certain elements from two batches and use those to build a new vector. It can contain signed or unsigned integer values, and only the most significant bit in the mask elements matters. The most significant bits of a mask component in Select determines which batch to use for the corresponding element in the output vector.

  • 00:05:00 In this section, the tutorial discusses two OpenCL built-in functions: vloadn and select. Vloadn is used to load elements from a specified vector into a new vector, and select is used to select elements from either the first or second input based on a mask. The tutorial offers examples of how these functions work in practice, including how vloadn selects values from the first input vector based on the mask and how select works to choose bits from the first or second input vectors.
OpenCL built-in functions: vloadn , select
OpenCL built-in functions: vloadn , select
  • 2020.04.05
  • www.youtube.com
OpenCL built-in functions: vloadn , select
 

44. Intro to DPC++



44. Intro to DPC++

This video introduces DPC++, a high-level language for data parallel programming that offloads complex computing to accelerators such as FPGAs and GPUs, and is part of the OneAPI framework. DPC++ aims to speed up data parallel workloads using modern C++ and architecture-oriented performance optimization. The lecturer provides a simple DPC++ example that demonstrates how to declare data management variables and execute a kernel function on a device using a command and accessor. The video also explains how the lambda function can take arguments and references from the variables declared outside of it.

  • 00:00:00 In this section, the lecturer introduces DPC++ programming, a high-level language for data parallel programming that offloads complex computing to accelerators such as FPGAs and GPUs. DPC++ uses modern C++ and aims to speed up data parallel workloads by analyzing algorithms, decomposing tasks or data, and using architecture-oriented performance optimization. DPC++ is part of the OneAPI framework, and its goal is to enable programming with a single language that can be executed on any CPUs, FPGAs, or GPUs. The lecturer then provides a simple DPC++ example that declares variables, a buffer, and a device queue for data management.

  • 00:05:00 In this section, the speaker introduces an example of a DPC++ program that creates a command and a lambda function to define a kernel function that will execute on the device. The program uses an accessor to associate a buffer with the command, and another accessor to access the result. Finally, the program includes a for loop to use the result accessor to access the content in the buffer and print it out. The lambda function can have different ways to take arguments into the function, such as passing in references to the variables declared outside of the function.
Intro to DPC++
Intro to DPC++
  • 2021.04.07
  • www.youtube.com
This videos give a brief introduction to DPC++ and go through a simple DPC++ example source code.
 

45. How to Think In Parallel ?



45. How to Think In Parallel ?

The video teaches about parallel programming by using matrix multiplication as an example. It highlights the parallelism in this computation, where multiple rows and columns can be calculated independently. The implementation of a single element calculation in matrix C is shown using a kernel function that allows for parallel computation. The use of accessors, range, and parallel kernel functions are explained in detail. The steps involved in passing the range value into the kernel function is discussed. A demo of matrix multiplication using Intel FPGA dev cloud is also demonstrated.

  • 00:00:00 In this section, the video introduces matrix multiplication as a commonly used example to teach parallel programming. The video explains that matrix multiplication involves taking rows from one matrix and columns from another to perform element-wise multiplication and accumulation to produce a resulting matrix. The video explains that there is a lot of parallelism in this computation, as different rows and columns can be calculated independently from each other. A simple implementation of matrix multiplication is shown using regular C or C++ language with nested for loops performing the element-wise multiplication and accumulation.

  • 00:05:00 In this section, we learn about the implementation of a single element calculation in matrix C, which is implemented as one kernel function that allows for parallel computation. The key point is that for every single row and column, the computation is the same, with the only difference being the row and column numbers. Accessors help to access the buffers in kernels, with read-only access for matrices A and B, and write access for matrix C. The range is used as an abstraction to declare multiple dimensions, and the H.parallel4 helps define a parallel kernel function. The kernel function includes a lambda function, with the argument being the variable to iterate through all the values in both dimensions.

  • 00:10:00 In this section, the speaker explains the steps involved in passing the range value into the kernel function which is a laminar function. They discuss the two dimensions of the variable and how it identifies each variable. The speaker goes through how the lambda function works and shows how the problem size is defined by the number of rows and columns we execute the kernel functions on. They use the example of matrix multiplication, the traditional C plus plus notation, and the element-wise multiplication and accumulation done in the innermost for-loop. Finally, they demonstrate a quick demo of matrix multiplication using Intel FPGA dev cloud.
How to Think In Parallel ?
How to Think In Parallel ?
  • 2021.04.07
  • www.youtube.com
This video use matrix multiplication example to introduce how to think in parallel with design data parallel algorithms.