5.GPT backpropagation methods

In the previous sections, we looked at the architecture of the GPT model and even implemented methods to initialize our new class and implement the feed-forward pass through the model algorithm. Now let's look at a possible implementation of the backpropagation pass for this algorithm.

To implement the backpropagation pass in each new class we override three methods:

  • CalcHiddenGradient — method for calculating the error gradient through the hidden layer
  • CalcDeltaWeights — method for calculating the error gradient to the level of the weight matrix
  • UpdateWeights — method for updating weights

This class will not be an exception, and we will redefine all three methods. Let's start with the first backpropagation and, probably, the most complex method: error gradient propagation through the hidden layer. It is in this method that we have to repeat the entire feed-forward algorithm in reverse order.

In the parameters, the method receives a pointer to the object of the previous layer, to which we have to pass the error gradient. Again, in the body of the method, we implement a block of checks. In it, according to the already established good tradition, we check the validity of pointers to all objects used in the method. This approach helps eliminate many critical errors during the execution of the method code.

bool CNeuronGPT::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
//--- check the relevance of all objects
   if(!m_cOutputs || !m_cGradients ||
      m_cOutputs.Total() != m_cGradients.Total())
      return false;

Next, by analogy with the forward pass method, we organize a loop for searching through the internal neural layers. But this time, in accordance with the principles of backward pass, we also organize the cycle with a countdown of iterations. All further iterations will be performed in the body of the loop and repeated for all nested layers of our model.

//--- run a loop through all internal layers in reverse order
   for(int layer = m_iLayers - 1layer >= 0layer--)
     {
      CNeuronBase *FF2 = m_cFF2.At(layer);
      if(!FF2)
         return false;
      CBufferType *Gradients = FF2.GetGradients();
      //--- scale the gradient for normalization
      if(!NormlizeBufferGradient(FF2.GetOutputs(), Gradients,
                                               GetPointer(m_dStd[layer]), 1))
         return false;

In the body of the loop, we first retrieve a pointer to the corresponding neural layer of the output of the Feed Forward FF2 block and adjust its error gradient buffer to the derivative of the normalization function. We discussed the reasons for this operation in detail when constructing a similar method for the Self-Attention algorithm.

After this, we sequentially call the error gradient distribution methods for the internal layers of the Feed Forward block. We also call the methods in the reverse order: first for the second layer, and then for the first one.

      //--- propagate a gradient through the Feed Forward block
      CNeuronBase *FF1 = m_cFF1.At(layer);
      if(!FF2.CalcHiddenGradient(FF1))
         return false;
      CNeuronBase *W0 = m_cW0.At(layer);
      if(!FF1.CalcHiddenGradient(W0))
         return false;

During the feed-forward pass, we added up the results of the Multi-Heads Self-Attention and Feed Forward blocks. Also, now we need to draw an error gradient in two directions. We add the error gradients at the output level of the specified blocks. Then we adjust the total tensor by the derivative of the layer normalization function.

      CBufferType *attention_grad = W0.GetGradients();
      if(!attention_grad.SumArray(Gradients))
         return false;
      //--- scale the gradient for normalization
      if(!NormlizeBufferGradient(W0.GetOutputs(), attention_grad,
                                            GetPointer(m_dStd[layer]), 0))
         return false;

Next, we distribute the error gradient across the attention heads by calling the error gradient distribution method of the internal neural layer W0.

      //--- initialize Scores
      CNeuronBase *Scores = m_cScores.At(layer);
      if(!Scores)
         return false;
      //--- distribute the error gradient across the heads of attention
      CNeuronBase *AttentionOut = m_cAttentionOut.At(layer);
      if(!W0.CalcHiddenGradient(AttentionOut))
         return false;

Until now, everything was simple and transparent. We simply called the corresponding methods of our internal neural layers in reverse order. But then comes the algorithm block that is not covered by the methods of internal neural layers. It was implemented inside the feed-forward method. Therefore, we also have to completely recreate the error gradient backpropagation functionality.

First, let's do the preparatory work and create local pointers to Querys, Keys, and Values objects. At this point, don’t forget to check the validity of the received object pointers.

      //--- get pointers to Querys, Keys, Values objects
      CNeuronBase *Querys = m_cQuerys.At(layer);
      if(!Querys)
         return false;
      CNeuronBase *Keys = m_cKeys.At(layer);
      if(!Keys)
         return false;
      CNeuronBase *Values = m_cValues.At(layer);
      if(!Values)
         return false;

Next, we need to create two options for implementing the algorithm: using standard MQL5 tools and in multi-threaded operations mode using OpenCL technology. We create a branching of the algorithm depending on the selected device for performing mathematical operations. As usual, in this section, we will look at the implementation of the algorithm using standard MQL5 tools and will return to the implementation of the multi-threaded operations block in other sections.

To organize calculations using standard MQL5 tools, we prepare dynamic arrays. In one array, we load error gradient data from the buffer. Some arrays are filled with the results of the feed-forward pass and others are initialized with zero values for subsequent gradient error accumulation operations.

      //--- branching of the algorithm across the computing device
      attention_grad = AttentionOut.GetGradients();
      if(!m_cOpenCL)
        {
         MATRIX gradients[];
         if(!attention_grad.m_mMatrix.Vsplit(m_iHeadsgradients))
            return false;
         if(!Querys.GetGradients().m_mMatrix.Reshape(3m_iHeads * m_iKeysSize))
            return false;
         MATRIX values[];
         if(!Values.GetOutputs().m_mMatrix.Vsplit(m_iHeadsvalues))
            return false;
         MATRIX keys[];
         if(!Keys.GetOutputs().m_mMatrix.Vsplit(m_iHeadskeys))
            return false;
         MATRIX querys[];
         MATRIX query = Querys.GetOutputs().m_mMatrix;
         if(!query.Reshape(3m_iHeads * m_iKeysSize) ||
            !query.Resize(1query.Cols()))
            return false;
         if(!query.Vsplit(m_iHeadsquerys))
            return false;
         MATRIX querys_grad = MATRIX::Zeros(m_iHeadsm_iKeysSize);
         MATRIX keys_grad = querys_grad;
         MATRIX values_grad = querys_grad;

First, we will distribute the error gradient to the Value tensor. It's important to note that we'll be distributing the error gradient not across the entire tensor but only for the current element. This is reasonable when we consider the purpose of error gradient distribution. We aim to optimize the model parameters throughout the training process, and distributing the error gradient helps us obtain guidelines for this optimization.

When distributing the error gradient to the Value tensor, we need to pass it in two directions: to the previous layer and to the weight matrix responsible for forming the current layer's tensor.

We can only transfer the error gradient for the current state to the previous layer. The buffer of the previous layer is unable to accept more because, during the feed-forward pass, it only provides the current state for which it expects the error gradient.

Also, only the current state error gradient can be propagated to the weight matrix. To distribute the error from previous states, we would need the input data from those previous states. However, the previous layer does not provide this information, and we did not save it in the buffers of our layer.

Therefore, distributing the gradient to the elements of the value tensor, except for the current state, is a dead-end task and does not make sense.

The general approach is as follows: during the feed-forward pass, we calculate only the current state and additionally retrieve from memory those already calculated in previous iterations. A similar situation applies during the backpropagation pass: it is assumed that the error gradient from previous states has already been considered in the backpropagation methods in previous iterations. This significantly reduces the number of operations for each iteration of the feed-forward and backpropagation passes.

I hope the logic is clear. Let's return to our backpropagation method. We paused at passing the error gradient to the Value tensor. To execute this iteration, we will first create a local pointer to the attention coefficient vector and then organize a loop.

Our loop will iterate through the active attention heads. Here, we immediately save the attention coefficient vector corresponding to the analyzed attention head in a local matrix. We multiply the gradient vector obtained from previous iterations by the attention coefficient for the current element of the sequence. The resulting values are saved in the error gradient matrix in the Values buffer.

         for(int head = 0head < m_iHeadshead++)
           {
            MATRIX score = MATRIX::Zeros(1m_iUnits);
            if(!score.Row(Scores.GetOutputs().m_mMatrix.Row(head), 0))
               return false;
         //--- distribution of the gradient on Values
            if(!values_grad.Row((gradients[head] * 
                                     score[0m_iCurrentPosition]).Row(0), head))
               return false;

Next, we need to distribute the gradient in the second direction: through the matrix of dependency coefficients on the Query and Key tensors. But first, we need to propagate the gradient through the vector of dependence coefficients. We multiply the error gradient matrices at the output of the attention block and the Values ​​matrix and obtain a gradient at the level of the vector of dependence coefficients.

So, we have a vector of error gradients for one attention head. But I would like to remind you that during the feed-forward pass, we normalized the vector of dependence coefficients with the Softmax function. Therefore, the obtained error gradients are valid for normalized data. To further distribute the error gradients, we need to adjust the error gradients to the derivative of the specified function.

A special feature of the Softmax function is the requirement for a complete set of tensor values to compute the value of each element. Similarly, to compute the derivative of one element, we need a complete set of values for the function results. In our case, the results of the function are the normalized vector of dependency coefficients, which we obtained during the forward pass. We have also already obtained the vector of error gradients. Thus, we have all the necessary initial data to perform the operations of finding the derivative of a function and adjusting the error gradient. The formula for the derivative of the Softmax function is as follows:

The practical part of the error gradient adjustment operations is implemented using MQL5 matrix operations. After adjusting the error gradients, we divide the resulting vector by the square root of the dimension of the Key vector of one element of the sequence. We performed the same operation during the feed-forward pass to prevent uncontrolled growth of non-normalized dependency coefficients.

            //--- gradient distribution to Querys and Keys
            MATRIX score_grad = gradients[head].MatMul(values[head].Transpose());
            //---
            MATRIX ident = MATRIX::Identity(m_iUnitsm_iUnits);
            MATRIX ones = MATRIX::Ones(m_iUnits1);
            score = ones.MatMul(score);
            score = score.Transpose() * (ident - score);
            score_grad = score_grad.MatMul(score.Transpose()) /
                                                           sqrt(m_iKeysSize);
            MATRIX temp = score_grad.MatMul(keys[head]);
            if(!querys_grad.Row(temp.Row(0), head))
               return false;
            temp = querys[head] * score_grad[0, m_iCurrentPosition];
            if(!keys_grad.Row(temp.Row(0), head))
               return false;
           }

As a result of these operations, we obtain the adjusted error gradient for one element of the dependency coefficient vector. But we will not save it to the next data buffer. Instead, we will immediately distribute it to the corresponding elements of the Query and Key tensors. To do this, we need to multiply this value by the vector of the opposite tensor. To determine the error gradient on the Qwery vector, we have a complete set of sequence elements in the Key tensor. However, in the Qwery tensor, we only have one sequence element. Therefore, the error gradient on the Key tensor will be propagated only for the current element of the sequence. We save the obtained error gradient values into the matrices we prepared earlier.

By obtaining error gradients at the levels of Query and Keys tensors, we complete the operations of the loop through attention heads.

As soon as the full loop of iterations is completed, our querys_grad, keys_grad, and values_grad matrices will contain the accumulated error gradients for the current sequence element across all attention heads. All we have to do is transfer its values to the error gradient buffer of our internal Querys layer.

         if(!querys_grad.Reshape(1m_iHeads * m_iKeysSize) ||
            !keys_grad.Reshape(1m_iHeads * m_iKeysSize) ||
            !values_grad.Reshape(1m_iHeads * m_iKeysSize))
            return false;
         if(!Querys.GetGradients().Row(querys_grad.Row(0), 0) ||
            !Querys.GetGradients().Row(keys_grad.Row(0), 1) ||
            !Querys.GetGradients().Row(values_grad.Row(0), 2))
            return false;
         if(!Querys.GetGradients().Reshape(1Querys.GetGradients().Total()))
            return false;
        }
      else // OpenCL block
        {
         return false;
        }

This concludes the block for separating the operations of the algorithm depending on the device for performing the operations. Next, we will continue executing the algorithm using the methods of our internal neural layers.

Previously, we obtained a concatenated tensor of error gradients that includes data from all attention heads and from all three entities (Query, Key, Value). Now, using the method that propagates the gradient through the hidden layer of our internal neural layer Querys.CalcHiddenGradient, we can transfer the error gradient to the previous layer buffer. Before performing this operation, we need to decide in which object’s buffer we will write the error gradients. We created this class as a multi-layer block, and all operations of the method are performed in a loop iterating through the active layer of our block. Therefore, to the object of the previous neural layer, whose pointer we received in the parameters of this method, we transfer data only from the first neural layer of our block. It will have index 0 in the collection of nested neural layers of our GPT block. All other nested neural layers must pass the error gradient to the internal neural layer buffer FF2 of the previous nested neural layer. Let me remind you that FF2 is the internal neural layer with the results of the Feed Forward block.

Therefore, we will create a local pointer to the object of the previous neural layer and assign it a pointer to the required object depending on the index of the active nested neural layer in our GPT block. Only after obtaining the correct pointer to the object of the correct previous layer, we transfer the error gradient to its buffer.

      //--- transfer the error gradient to the previous layer
      CNeuronBase *prevL = (layer == 0 ? prevLayer : m_cFF2.At(layer - 1));
      if(!Querys.CalcHiddenGradient(prevL))
         return false;
      if(!prevL.GetGradients().SumArray(W0.GetGradients()))
         return false;
     }
//---
   return true;
  }

Please note that when constructing similar methods in the implementation classes of attention mechanisms, at this point, we created a complete procedure for summing error gradients from four directions. Now, thanks to the use of the concatenated error gradient buffer, we obtain the total error gradient from three directions by executing the method of only one neural layer. We still have to add gradients, but only once. To the obtained error gradient, we will add the error gradient at the level of the outputs of the multi-head attention block. You remember that during the feed-forward pass, we also added the original data with the tensor of the multi-head attention block's outputs. Therefore, the error gradient must go through all the steps that the signal goes through during the feed-forward pass, but in reverse order.

This concludes the operations in the body of the loop iterating through the nested neural layers of our GPT block, as well as the overall operations of our method. We close the loop and exit the method.

And once again, I want to emphasize: do not forget to monitor every step of the operation execution. This helps minimize the risk of critical errors and makes the program operation more controlled and reliable.

We have discussed the organization of the error gradient propagation method to the previous layer. But this is only one of the three backpropagation methods that we must override for this class. Therefore, after propagating the error gradient to the previous neural layer, we need to propagate the error gradient to the internal weight matrices contained within the depths of a considerable number of internal objects of the neural layers. In accordance with the structure of our class methods, this functionality is performed in the CalcDeltaWeights method.

To propagate the error gradient to the weight matrix of any of the previously discussed neural layers, two things are necessary:

  • The error gradient at the output level of a given neural layer to the activation function.
  • The initial data provided by the previous neural layer.

To organize this process, we already have all the necessary data. In the previous method, we distributed the error gradient to each neural layer. We will get a pointer to the previous neural layer in the parameters of the CNeuronGPT::CalcDeltaWeights method.

As usual, in the body of the method, we organize a control block to check the pointers of all used internal objects. The control block should be minimal and sufficient. Eliminate redundant and explicitly repetitive controls, as they do not add value to the program operation and can slow it down. Moreover, each operation, including control, requires resources and time. Let's think about the objects for which we should update weight matrices. These include:

  • The Query neural layer, which returns a concatenated tensor of three entities (Query, Key, Value).
  • The W0 matrix neural layer.
  • Two neural layers of the Feed Forward block.

All the mentioned objects are declared static. Therefore, there is no need to check their pointers since their presence is controlled by the system. This allows us to exclude the control block from this method.

Everything else is straightforward and simple. Let's organize a loop through all the nested neural layers of our GPT block. In the body of the block, we extract all the objects of the above collections, one by one. First, we check the pointer to the object, and then we call its method to propagate the error gradient to the level of the weight matrix.

bool CNeuronGPT::CalcDeltaWeights(CNeuronBase *prevLayerbool read)
  {
//--- in a loop, we call the method for each internal object
   for(int layer = 0layer < m_iLayerslayer++)
     {
      if(!m_cFF2.At(layer))
         return false;
      CNeuronBase *temp = m_cFF2.At(layer);
      if(!temp.CalcDeltaWeights(m_cFF1.At(layer), false))
         return false;
      temp = m_cFF1.At(layer);
      if(!temp.CalcDeltaWeights(m_cW0.At(layer), false))
         return false;
      temp = m_cW0.At(layer);
      if(!temp.CalcDeltaWeights(m_cAttentionOut.At(layer), false))
         return false;
      temp = m_cQuerys.At(layer);
      if(!temp)
         return false;
      CNeuronBase *prevL = (layer == 0 ? prevLayer : m_cFF2.At(layer - 1));
      if(!temp.CalcDeltaWeights(prevL, (read && layer == m_iLayers - 1)))
         return false;
     }
//---
   return true;
  }

It is worth mentioning a few words about the order in which methods of internal objects are called. From the perspective of mathematical operations, the order of method calls does not affect the final result. However, the order of method calls used in the loop body is not random. Note that in the loop body, we explicitly check the pointers for only two objects that do not serve as the input data for other internal layers. The reason is that the called methods of neural layers also have a control block that checks the incoming data, including the received pointers. To eliminate repeated checks of object pointers, we first pass a pointer to the object as input to another object, check the result of the operations of the called method, which, among other things, confirms the validity of the passed pointer, and then confidently access the object because its pointer was checked during the execution of the previous object method. In this way, we organize a comprehensive check of all object pointers without explicit control within the method body and eliminate redundant pointer checks that could slow down the program execution.

Next, we will consider the method for updating model parameters. This function does not require external object data. There is not a single object pointer in the method parameters, as there are only parameter values for executing the specified parameter optimization algorithm.

In the method body, we also organize a loop to iterate through the nested neural layers of our GPT block. In the loop body, we extract one object from each collection, check the validity of the pointer, and call the method to update the weight matrix of each object.

bool CNeuronGPT::UpdateWeights(int batch_sizeTYPE learningRate,
                               VECTOR &BetaVECTOR &Lambda)
  {
//--- in a loop we call the method for each internal object
   for(int layer = 0layer < m_iLayerslayer++)
     {
      CNeuronBase *temp = m_cFF2.At(layer);
      if(!temp || !temp.UpdateWeights(batch_sizelearningRateBetaLambda))
         return false;
      temp = m_cFF1.At(layer);
      if(!temp || !temp.UpdateWeights(batch_sizelearningRateBetaLambda))
         return false;
      temp = m_cW0.At(layer);
      if(!temp || !temp.UpdateWeights(batch_sizelearningRateBetaLambda))
         return false;
      temp = m_cQuerys.At(layer);
      if(!temp || !temp.UpdateWeights(batch_sizelearningRateBetaLambda))
         return false;
     }
//---
   return true;
  }

Since the called methods do not access external objects, our control optimization approach will not work here due to the absence of explicitly repetitive controls. Therefore, we need to explicitly check each object pointer before calling its method.

We have discussed the implementation of three backpropagation methods and with that, we conclude our work on implementing the GPT model algorithm in our CNeuronGPT class. For the complete implementation of functionality using standard MQL5 tools, we need to override the methods for working with files. We've already discussed the importance of these methods for the operation of neural network models.