- Description of the architecture
- Building a GPT model in MQL5
- Organizing parallel computing in the GPT model
- Comparative testing of implementations
Organizing parallel computing in the GPT model
We continue to work on our model class, GPT CNeuronGPT. In the previous sections, we have already recreated the model algorithm using standard MQL5 tools. Now it's time to supplement the model with the ability to perform mathematical operations in multi-threaded mode using the computing power of the GPU. This is the opportunity provided by OpenCL.
To organize this process, we have to perform two subtasks:
- Create an OpenCL program.
- Organize the call of the OpenCL program from the main program.
Let's start by creating an executable program on the OpenCL side. In this program, we need to implement that part of the algorithm that is not covered by the use of internal object methods. We have two such blocks: one in the feed-forward part, and the second, mirrored to the first, included in the error gradient propagation method when performing the backpropagation pass.
To execute the feed-forward algorithm, we will create the GPTFeedForward kernel. In part, the kernel algorithm will resemble a similar kernel for classes using attention mechanisms. This is not surprising since they all use the Self-Attention mechanism. However, each implementation has its nuances. Last time, instead of creating a new kernel for organizing multi-head attention, we were able to quickly modify the existing kernel of the Self-Attention algorithm. Now, creating a new kernel seems less costly compared to trying to create a universal kernel for all tasks.
Unlike the implementation of the Multi-Heads Self-Attention mechanism in which we translated the kernel into a two-dimensional task space, in this implementation we return to a one-dimensional space. This is due to the lack of the possibility of splitting the task into parallel threads in the context of the elements of the Query tensor sequence since the GPT model implementation only allows processing one query per iteration. In this case, we are left with a division by threads only in the context of attention heads.
In the parameters of the GPTFeedForward kernel, we will continue to pass pointers to five data buffers. However, the number of variables increases: earlier we obtained the size of the sequence from the dimension of the task space, but now we have to explicitly specify it in the kernel parameters. Here, an additional parameter is used to specify the current element in the sequences of keys and values.
__kernel void GPTFeedForward(__global TYPE *querys,
|
As mentioned earlier, the created kernel will operate in a one-dimensional space, focusing on the attention heads. Therefore, the first thing we do in the body of the kernel is determine the active attention head based on the identifier of the executing thread and the total number of attention heads, considering the total number of running threads.
const int h = get_global_id(0);
|
We immediately determine the offset in the Query and Score (dependency coefficient matrix) tensors.
Next, according to the Self-Attention algorithm that is being built, we determine the dependency coefficients between the elements of the sequence. To do this, we multiply the Query vector by the Key tensor. To implement these operations, we will create a system of two nested loops. The outer loop will iterate over the elements of the Key tensor sequence and, accordingly, the elements of the Score vector with dependency coefficients. In the body of the loop, we will define the offset in the Key tensor and prepare a local variable to count the intermediate values.
After that, we organize a nested loop with the number of iterations equal to the size of the description vector of one element of the sequence. In the body of this cycle, we perform the operation of multiplying a pair of vectors. The resulting value is divided by the square root of the dimension of the vector, and we take the exponent from it. We write the result of the operation into the corresponding element of the dependency coefficient vector and add it to the cumulative sum of all elements in the dependency coefficient vectors for subsequent data normalization.
We should consider the issue of the concatenated buffer of the results of the Query layer. The values of the key and query vectors of the current element in the sequence have not yet been transferred to the corresponding buffers. Therefore, we check the element that we are accessing in the key tensor. Before accessing the current element, we first copy the data to the buffer. Of course, for current operations, we could take data from the querys buffer. But we will need this data in subsequent iterations. Therefore, transferring them to the buffer is inevitable.
TYPE summ = 0;
|
As a result of performing a full cycle of iterations of the system created above from two loops, we get a vector of dependency coefficients. According to the Self-Attention algorithm, before further use of the obtained coefficients, they will have to be normalized by the Softmax function. When obtaining the exponent from the products of vectors, we have already executed part of the algorithm of the specified function. To complete the normalization operation, we just need to divide the values stored in the vector by their total sum, which we prudently collected in the local variable summ. Therefore, we organize another loop with the number of iterations equal to the size of the vector of dependency coefficients. In the body of this loop, we will divide all the values of the vector by the value of the local variable summ.
for(int s = 0; s < units; s++)
|
Thus, after completing the iterations of the loop in the Score vector, we get the normalized values of the dependency coefficients with the total sum of all elements in the block. In fact, the obtained coefficients give us an idea of the proportion of influence of each element of the sequence from the Value tensor on the final value of the analyzed element of the sequence in the tensor of the results of the current attention head.
This means that in order to obtain the final values, we need to multiply the Score vector with normalized dependency coefficients by the Value tensor. To perform this operation, we need another system of two nested loops. But before running it, we will determine the offset in the tensor of the results before the beginning of the vector of the analyzed element of the sequence.
The outer loop, with the number of iterations equal to the size of the vector describing one element of the sequence, indicates the ordinal number of the collected element in the result vector. The nested loop, with the number of iterations equal to the number of elements in the sequence, helps correlate the vectors of the Value tensor with the dependency coefficients from the Score vector. In the body of the nested loop, we multiply the vector from the Value tensor by the corresponding element dependency coefficient. The resulting values of the products will be accumulated in a local variable. After completing the iterations of the inner loop, we save the obtained value in the buffer of the Self-Attention block results.
shift_query = key_size * h;
|
As a result of completing the full cycle of iterations within the loop system, we obtain the vector describing one element of the sequence in the tensor of results for one attention head. The task assigned to this kernel has been completed, and we can exit it.
This concludes the work with the feed-forward kernel, and we move further along. Now we need to organize the backpropagation process. The implementation of this task will be split into two kernels. In the GPTCalcScoreGradient kernel, we will propagate the error gradient to the vector of the dependency coefficients. In the GPTCalcHiddenGradient kernel, we will continue the propagation of the error gradient up to the level of the Query and Key tensors.
Let's take it step by step. The GPTCalcScoreGradient kernel in the parameters receives pointers to six data buffers and three parameters:
- scores buffer for the vector of dependency coefficients
- scores_grad buffer for the error gradient vector at the level of dependency coefficients
- values buffer for the Value tensor
- values_grad buffer for the error gradient tensor at the Value level
- outputs_grad error gradient tensor buffer at the result level of the Self-Attention block
- scores_temp buffer for writing intermediate values
- window size of the vector describing one element of the sequence in the Value tensor
- units number of elements in the sequence
- current ordinal number of the current item in the Value stack
__kernel void GPTCalcScoreGradient(__global TYPE *scores,
|
As with the feed-forward pass, we run the kernel in one-dimensional space by the number of attention heads used. In the body of the kernel, we immediately determine the active attention head based on the thread identifier and the total number of attention heads considering the total number of launched threads.
const int h = get_global_id(0);
|
We also determine the offset in the tensors of the error gradients at the level of Value and in the vector of the dependency coefficients. Note that the offset in the tensor of error gradients for Value and in the Value tensor itself will be different in this case.
In this implementation of the GPT model, we used one internal neural layer to generate a concatenated tensor containing the values of Query, Key, and Value for all attention heads. Accordingly, we assemble the error gradient into a similar concatenated tensor of the error gradients for the specified neural layer. However, this tensor contains only the current element of the sequence. At the same time, the Value stack tensor contains complete information about the entire sequence but only for the Value tensor.
After the preparatory work, we distribute the error gradient to the Value tensor. As mentioned above, we distribute the error gradient only for the current element of the sequence. To do this, we organize a loop with the number of iterations equal to the size of the description vector of one element of the sequence in the Value tensor. In the body of the loop, will multiply the error gradient vector at the result level of the Self-Attention block by the corresponding dependency coefficient from the Score vector. The obtained values are saved in the buffer of the concatenated tensor of error gradients.
//--- Gradient distribution to Values
|
After calculating the error gradient on the Value tensor, we will determine the value of the error gradient at the level of the dependency coefficient vector. To perform this operation, we will need a system of two loops: an outer loop with the number of iterations equal to the number of elements in the sequence and a nested loop with the number of iterations equal to the size of the description vector for one element in the Value tensor. In the body of the nested loop, we will multiply 2 vectors (Value and error gradient). The resulting value is stored in the temporary data buffer.
//--- Gradient distribution to Score
|
After completing the full cycle of iterations within the loop system, in the temporary buffer, we will obtain a fully populated gradient vector of error for the dependency coefficient vector. But to distribute the error gradient further, we first need to correct it to the derivative of the Softmax function.
Let's organize another system of two nested cycles. Both loops will contain the number of iterations equal to the number of elements in the sequence. In the body of the nested loop, we will calculate the derivative of the function using the formula.
//--- Adjust to the Softmax derivative
|
We will save the obtained values in the error gradient buffer at the level of the dependency coefficient vector. At this stage, we complete the work of the first kernel and move on to the second one.
In the second kernel of the GPTCalcHiddenGradient backpropagation process, we have to propagate the error gradient further and bring it to the level of Query and Key tensors.
In parameters, the GPTCalcHiddenGradient kernel receives pointers to 4 data buffers and 3 parameters.
__kernel void GPTCalcHiddenGradient(__global TYPE *querys,
|
Note that we talked about distributing the gradient into the Query and Key tensors. In the kernel parameters, there is a pointer only to the Query error gradient buffer. This situation is made possible by the use of a concatenated buffer, in which we have already saved the error gradient at the level of the Value tensor. Now we add error gradients at the level of the Query and Key tensors to the same buffer.
In the kernel body, we determine the ordinal number of the analyzed attention head based on the thread identifier and the number of used attention heads considering the total number of tasks launched.
const int h = get_global_id(0);
|
Here we also define the offsets in the data buffers before the beginning of the analyzed vectors.
Next, we organize a system of two nested loops, in the body of which we will determine the error gradients at the level of the tensors we are looking for. To do this, multiply the error gradient at the level of the dependency coefficient vector by the opposite tensor.
//--- Gradient distribution on Querys and Keys
|
Note that we calculate the error gradient only for the current element of the sequence and save the obtained values in the corresponding elements of the error gradient buffer.
As a result of all iterations of our loop system, we get a fully filled concatenated tensor of error gradients of all three entities (Query, Key, and Value). We complete the work on building the OpenCL program and move on to building the functionality on the side of the main program.
To make it more convenient to manage the constructed kernels in the main program, let's create named constants for calling kernels and accessing their elements. To do this, we open our constants file defines.mqh and create kernel access constants in it.
#define def_k_GPTFeedForward 34
|
Then we add access constants to kernel parameters.
//--- GPT feed-forward pass
|
//--- determine the gradient at the matrix of GPT dependency coefficients
|
//--- gradient distribution via GPT
|
After that, we go to the dispatch service class of the CNet neural network model and, in the OpenCL initialization method InitOpenCL, we change the total number of kernels in our program. Next, we initialize the creation of new kernels in the OpenCL context.
bool CNet::InitOpenCL(void)
|
This concludes the preparatory work and goes directly to the methods of our CNeuronGPT class. In them, we have to perform three stages of work to call each kernel:
- Preparing the input data and transferring it to the memory of the OpenCL context.
- Placing the kernel in the execution queue.
- Loading the results of program execution into the memory of the main program.
First, we modify the CNeuronGPT::FeedForward method. In the block for organizing multi-threaded computing using OpenCL, we first check for the presence of an already created buffer in the memory of the OpenCL context.
bool CNeuronGPT::FeedForward(CNeuronBase *prevLayer)
|
When all buffers have been created, and those that are necessary for kernel operation have been passed to the OpenCL context memory, we pass pointers to the used data buffers and the necessary constants to the kernel parameters.
//--- pass parameters to the kernel
|
At this stage, the preparatory work is completed. Let's move on to the stage of placing the kernel in the execution queue. Here, we first create two dynamic arrays in which we specify the offset and the number of running threads in each task subspace. Then call the m_cOpenCL.Execute method that places the kernel in the queue.
//--- Place a kernel in the queue
|
This concludes the CNeuronGPT::FeedForward method. But we still have to do similar work in the CNeuronGPT::CalcHiddenGradient method of the backpropagation algorithm.
Let me remind you that in order to implement the backpropagation method, we have created two kernels that will be called sequentially one after the other. Therefore, the kernel maintenance work must be repeated for each of them.
First, let's create data buffers for the first kernel.
bool CNeuronGPT::CalcHiddenGradient(CNeuronBase *prevLayer)
|
Following our algorithm for working with the OpenCL context, after creating data buffers and passing all the necessary information to the context memory, we pass pointers to the used data buffers and constants for executing the program algorithm to the parameters of the kernel being launched.
//--- pass parameters to the kernel
|
Note that instead of the Value tensor error gradient buffer, we pass a pointer to the gradient buffer of the inner neural layer Querys. This is because we used a concatenated error gradient buffer for all three tensors. To eliminate the subsequent data copy operation, we will immediately write the data to the concatenated buffer.
After that, we perform the operation of placing the kernel in the queue. Let me remind you that we are launching a kernel to perform tasks in one-dimensional space in the context of attention heads.
Let's specify the offset in the task space and the number of threads to be started in the corresponding dynamic arrays. After that, we call the method of queuing our kernel.
//--- Place the kernel in queue
|
This concludes the work on the first kernel, and we move on to building a similar algorithm for the second kernel of the backpropagation pass.
Now we will check the additional buffers in the memory of the OpenCL context.
if(Querys.GetOutputs().GetIndex() < 0)
|
We pass the parameters to the kernel.
if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTHiddenGradients,
|
After that, we will put the kernel in the execution queue. Note that this time we do not create arrays of offset and dimensionality of the task space. We simply use the arrays created during the execution of the previous kernel without modification.
if(!m_cOpenCL.Execute(def_k_GPTHiddenGradients, 1, off_set, NDRange))
|
This completes the work on building a GPT model class, and we can proceed to evaluate the results of the work done.