Building Self-Attention with MQL5 tools

The presented Self-Attention architecture may seem rather difficult to understand and to implement after the first acquaintance. Let's not be pessimistic. We will try to break down the whole algorithm into small components. Then, with the implementation of each individual block, we will assemble the overall picture, and it will no longer be so complex to understand. At the same time, you will be amazed at how we manage the work and build a functional mechanism for our library.

Now let's get to work. To implement our Self-Attention layer, let's create a new CNeuronAttention class. As always, we will inherit from our base class of the neural layer, CNeuronBase.

class CNeuronAttention    :  public CNeuronBase
  {
public:
                     CNeuronAttention(void);
                    ~CNeuronAttention(void);
   //---
   virtual bool      Init(const CLayerDescription *descoverride;
   virtual bool      SetOpenCL(CMyOpenCL *opencloverride;
   virtual bool      FeedForward(CNeuronBase *prevLayeroverride;
   virtual bool      CalcHiddenGradient(CNeuronBase *prevLayeroverride;
   virtual bool      CalcDeltaWeights(CNeuronBase *prevLayeroverride;
   virtual bool      UpdateWeights(int batch_sizeTYPE learningRate,
                                   VECTOR &BetaVECTOR &Lambdaoverride;
   //--- methods of working with files
   virtual bool      Save(const int file_handleoverride;
   virtual bool      Load(const int file_handleoverride;
   //---object identification method
   virtual int       Type(voidoverride  const { return(defNeuronAttention); }
  };

Let's consider the first action of the Self-Attention algorithm which is the computation of the Query, Key, and Value vectors. At the input, we get a tensor of raw data containing features for each bar of the analyzed sequence. Sequentially, we take the features of one candlestick and, by multiplying them with a weight matrix, obtain a vector. Then we take the features of the second candlestick and multiply them by the same weight matrix so we get the second vector similar to the first one. Does this look similar to the convolution layer created earlier? Here, the length of the result vector is equal to the number of filters used in the convolution layer. Hence, to organize the above process, we declare three nested convolutional layers CNeuronConv. We use the appropriate layer names to make the code easier to read.

class CNeuronAttention    :  public CNeuronBase
  {
protected:
   CNeuronConv       m_cQuerys;
   CNeuronConv       m_cKeys;
   CNeuronConv       m_cValues;
   .....
  };

According to this algorithm, in the next step, we determine the Score matrix by multiplying the Query and Key matrices. To write the matrix data, we will create a data buffer as an object of the CBufferType class.

class CNeuronAttention    :  public CNeuronBase
  {
protected:
   .....
   CBufferType       m_cScores;
   .....
  };

After determining the Score dependency coefficient matrix, we will need to find the weighted values. To do this, we multiply the Values vectors by the corresponding values of the Score matrix. After additional processing, we obtain a tensor equal to the size of the initial data. We will talk about the reasons for the same size during the implementation process. Right now, let's just note for ourselves the need for a data warehouse. To collect data in the storage, we will need to set up a new process, so we require an object with easy access for writing data. In the future, we plan to pass data as input to the internal neural layer. So, the neural layer template of the raw data will be the most suitable for us. We use a basic neural layer with zero input window as the input data layer.

class CNeuronAttention    :  public CNeuronBase
  {
protected:
   .....
   CNeuronBase       m_cAttentionOut;
   .....
  };

Here, it's important to note the difference between the output of the Self-Attention algorithm and the output of the entire CNeuronAttention class. The first one is obtained after execution of the Self-Attention algorithm by adjusting the values of Value vectors. We save it to the instance of the object of the basic neural layer m_cAttentionOut. The second one is obtained after processing in the Feed Forward block. This one is saved to the result buffer of our class.

So, next, we need to organize the Feed Forward block. We will create it from two consecutive convolution layers. It may seem unusual to use a convolutional layer when the solution architecture is described as having fully connected layers. The situation here is similar to the first point of the algorithm when we determined the value of Query, Key, and Value vectors. Looking at the block within the context of one element of the sequence, we can see two fully connected neural layers. However, when considering the entire time series, it becomes evident that the same weight matrix is applied sequentially to each element of the sequence. Furthermore, as the input data progresses sequentially, the results are laid out in the same order. Doesn't this resemble the operation of a convolutional layer? We just need to take the convolution layer and set the width of the source data window equal to the vector size of one sequence element. The step of the initial data window is set equal to the window width, and the number of filters used is determined by the size of the fully connected layer for one element of the sequence.

Thus, we add two convolution layers to organize the Feed Forward block.

class CNeuronAttention    :  public CNeuronBase
  {
protected:
   .....
   CNeuronConv       m_cFF1;
   CNeuronConv       m_cFF2;
   .....
  };

We identified the objects we need to organize the Self-Attention mechanism in our class. To complete the picture, let's add a few more variables:

  • m_iWindow — width of the initial data window (size of one sequence element vector)
  • m_iUnits — the number of units in the sequence
  • m_iKeysSize — width of the result vector size for Query and Key
  • m_dStd — during normalization of the layer, we will divide the value by the standard deviation and will save the result to determine the derivative

Taking into account the standard set of functions for overriding, the class structure will have the following form.

class CNeuronAttention    :  public CNeuronBase
  {
protected:
   CNeuronConv       m_cQuerys;
   CNeuronConv       m_cKeys;
   CNeuronConv       m_cValues;
   CBufferType       m_cScores;
   int               m_cScoreGrad;
   int               m_cScoreTemp;
   CNeuronBase       m_cAttentionOut;
   CNeuronConv       m_cFF1;
   CNeuronConv       m_cFF2;
   //---
   int               m_iWindow;
   int               m_iUnits;
   int               m_iKeysSize;
   CBufferType       m_cStd;

public:
                     CNeuronAttention(void);
                    ~CNeuronAttention(void);
   //---
   virtual bool      Init(const CLayerDescription *descoverride;
   virtual bool      SetOpenCL(CMyOpenCL *opencloverride;
   virtual bool      FeedForward(CNeuronBase *prevLayeroverride;
   virtual bool      CalcHiddenGradient(CNeuronBase *prevLayeroverride;
   virtual bool      CalcDeltaWeights(CNeuronBase *prevLayeroverride;
   virtual bool      UpdateWeights(int batch_sizeTYPE learningRate,
                                   VECTOR &BetaVECTOR &Lambdaoverride;
   //--- methods for operations with files
   virtual bool      Save(const int file_handleoverride;
   virtual bool      Load(const int file_handleoverride;
   //--- object identification method
   virtual int       Type(voidoverride  const { return(defNeuronAttention); }
  };

In the class constructor, we only set the initial values of the variables.

Please note that in this class, we are using static objects rather than pointers to objects as we did previously. The lifetime of static objects, like variables, is equal to the lifetime of the object containing them. By using such objects, we avoid the need to create object instances during class initialization and to clean up memory when the class operation is completed. Also, we don't need to check the validity of the object pointer every time. This saves some time in performing each method. However, in this case, we cannot replace objects by copying only the object's pointer, which this property is actively used in our activation class and in recurrent networks (using the same object pointers when analyzing the entire depth of the history).

CNeuronAttention::CNeuronAttention(void) :   m_iWindow(1),
                                             m_iUnits(0),
                                             m_iKeysSize(1)
  {
   m_cStd.BufferInit(121);
  }

Since we use static objects, we leave the class destructor empty.

CNeuronAttention::~CNeuronAttention(void)
  {
  }

 

Method of class initialization

After creating the class constructor and destructor, we move on to overriding the main methods of the class. First, we will override the class initialization method CNeuronAttention::Init. The main task of this method is to prepare the class to perform its functionality with user-defined parameters. Like similar methods in other previously discussed classes, the method receives an instance of the CLayerDescription object as a parameter, in which the parameters of the initialized neural layer are specified. Therefore, in order to eliminate possible errors in future work, we organize the block of initial data verification. In this method, we will check for the presence of the minimum required parameters in the received data.

bool CNeuronAttention::Init(const CLayerDescription *desc)
  {
//--- check the initial data
   if(!desc || desc.type != Type() || desc.count <= 0 ||
       desc.window <= 0 || desc.window_out <= 0)
      return false;

After that, we will save the main parameters into specially prepared variables. Note the correlation between the parameters of the neural layer description class and their functional purpose:

  • CLayerDescription.window — the size of the source data window, a vector of source data of one element of the sequence (in our case description of one bar)
  • CLayerDescription.count — the number of elements in the sequence (the number of analyzed bars)
  • CLayerDescription.window_out — size of the result vector for Query and Key

   m_iWindow   = desc.window;
   m_iUnits    = desc.count;
   m_iKeysSize = desc.window_out;

As before, we start initializing the object by calling a similar initialization method of the parent class. But there's a nuance here. We cannot simply transfer the resulting description of the neural layer. We will create a new instance of the neural layer description object and CLayerDescription and enter the corrected data into it.

In the count field, we specify the total number at the output of the layer, which is obtained by multiplying the count and window fields of this object.

Note that to obtain the total number of elements in the output of the neural layer, we multiply the number of elements in the sequence (number of bars analyzed) by the size of the source window (elements describing 1 bar), not the size of the results window. The reason is that we will use the results window size only for Query and Key tensors. The size of the result vector for the Value tensors and the second layer of the Feed Forward block will be equal to the size of the initial data window. This is done to align the dimensionality of the initial data and the results. The algorithm involves adding the tensors of the original data to the results of the Self-Attention block and then adding the tensors of the results of the Feed Forward and Self-Attention blocks, as well. Thus, as a result of tensor addition, the sequence at the output of our neural layer cannot be shorter than the initial data. And it doesn't make any sense to increase it. Therefore, we align the dimensions of the vectors.

In addition to changing the number of elements, we will also change the size of the output window, setting it to one. The size of the initial data window will be equal to zero. After that, we will call the parent class initialization method.

//--- calling the parent class initialization method
   CLayerDescription *temp = new CLayerDescription();
   if(!temp)
      return false;
   temp.count = desc.count * desc.window;
   temp.window_out = 1;
   temp.window     = 0;
   temp.optimization = desc.optimization;
   temp.activation = desc.activation;
   temp.activation_params = desc.activation_params;
   temp.type desc.type;
   if(!CNeuronBase::Init(temp))
     {
      delete temp;
      return false;
     }

Such a parameter substitution will allow the running of the parent class initialization method in the neural layer mode of the source data. At the same time, no additional buffers will be created for the weight matrix, as well as the corresponding optimization method buffers. As with the LSTM block, this neural layer will not have a separate weight matrix. All weight factors will be stored in the inner neural layers.

We specify a similar architecture for the inner data collection layer of the AttentionOut attention block. We will simply change the type of neural layer and explicitly disable the activation function.

//--- initialize AttentionOut
   temp.type = defNeuronBase;
   temp.activation=AF_NONE;
   if(!m_cAttentionOut.Init(temp))
     {
      delete temp;
      return false;
     }

Next, to initialize our internal neural layers, we need to create a description for them. We fill the previously created instance of the CLayerDescription class with the necessary data. Almost all of our internal neural layers are convolutional, so in the type parameter, we will specify defNeuronConv. The rest of the parameters are transferred without changes from the obtained external description.

//--- create a description for the internal neural layers
   temp.type = defNeuronConv;
   temp.window = desc.window;
   temp.window_out = m_iKeysSize;
   temp.step = desc.window;
   temp.count = desc.count;

Next, we proceed to initialize the internal neural layers. We first initialize the convolution layer to define Query vectors using a pre-built description. Don't forget to check the results of the operations.

//--- initialize Querys
   if(!m_cQuerys.Init(temp) || !m_cQuerys.SetTransposedOutput(true))
     {
      delete temp;
      return false;
     }

Note that we use the new CNeuronConv::SetTransposedOutput method after initializing the convolutional neural layer. The reasons for its appearance and its functionality will be discussed a bit later.

We initialize the Keys layer using a similar algorithm.

//--- initializing Keys
   if(!m_cKeys.Init(temp) || !m_cKeys.SetTransposedOutput(true))
     {
      delete temp;
      return false;
     }

Next, initialize the Values layer. We use the above algorithm with a small addition. As mentioned earlier, when initializing this object, the result window is set equal to the input data window. Therefore, we make changes to the neural layer description object and call the initialization method. Let's check the result of the operations.

//--- initialize Values
   temp.window_out = m_iWindow;
   if(!m_cValues.Init(temp) || !m_cValues.SetTransposedOutput(true))
     {
      delete temp;
      return false;
     }

Next, we initialize the Scores coefficient matrix. According to the Self-Attention mechanism algorithm, this is a square matrix with a side length equal to the number of elements in the sequence. For us, it is the number of bars analyzed.

In the discussion of this algorithm, it's important to understand the difference between the number of elements in the sequence and the total number of elements at the output of the neural layer. If you translate this to analyzing a candlestick chart of a change in a stock instrument, then:

  • The number of elements in a sequence is the number of bars to be analyzed.
  • The length of a vector of one sequence element (input / output window) is the number of elements describing 1 bar.
  • The total number of elements at the input/output of the neural layer is the product of the first two quantities.

Let's return to the initialization of the coefficient matrix buffer. For it, we have declared a data buffer. We will initialize it with zero values by setting the buffer size as a square matrix.

//--- initialize Scores
   if(!m_cScores.BufferInit(temp.counttemp.count0))
     {
      delete temp;
      return false;
     }

The Self-Attention algorithm is followed by the base neural layer object for recording attention results, which we have already initialized above.

All we have to do is initialize the Feed Forward block. As mentioned, it will consist of two convolutional neural layers. According to the architecture proposed by the authors, in the first neural layer, the tensor of results is four times larger than the input data. In addition, the authors used the ReLU activation function in the first neuron layer. We'll replace it with Swish. We will make the specified changes to the description of the neural layer and proceed with its initialization.

//--- initialize FF1
   temp.window_out *= 4;
   temp.activation = AF_SWISH;
   temp.activation_params[0] = 1;
   temp.activation_params[1] = 0;
   if(!m_cFF1.Init(temp) || !m_cFF1.SetTransposedOutput(true))
     {
      delete temp;
      return false;
     }

To initialize the second neural layer in the Feed Forward block, we need to increase the size of the input data window and its stride. The size of the results window should be resized to match the size of the attention block results tensor. It will also correspond to the tensor size of the previous layer.

For the second neural layer in the Feed Forward, we will use the activation function specified by the user during the initialization of our class.

After making the necessary changes to the description object of the neural layer, we will use the algorithm discussed earlier to initialize the last internal neural layer.

//--- initialize FF2
   temp.window = temp.window_out;
   temp.window_out = temp.step;
   temp.step = temp.window;
   temp.activation = desc.activation;
   temp.activation_params = desc.activation_params;
   if(!m_cFF2.Init(temp) || !m_cFF2.SetTransposedOutput(true))
     {
      delete temp;
      return false;
     }
   delete temp;

After initializing all internal neural layers, we delete the temporary neural layer description object. We don't need it anymore.

Now let's use a little trick. According to the algorithm, we obtain the result of the second neural layer operation in the result buffer of the Feed Forward block's second layer. To transfer the data to the subsequent neural layer, we need to transfer the data to the result buffer of our class. We will need additional time and resources at each iteration for the data copy operation. To avoid this, we can substitute pointers to objects. Remember that we discussed objects and pointers to them?

Initially, we delete the result buffer object of our class to avoid leaving unaccounted objects in memory. Then, in the variable used to store the pointer to the buffer object, we assign the pointer to a similar buffer in the second neural layer of the Feed Forward block. The same operation is performed for the gradient buffer.

//--- to avoid copying the buffers we swap them
   if(m_cOutputs)
      delete m_cOutputs;
   m_cOutputs = m_cFF2.GetOutputs();
   if(m_cGradients)
      delete m_cGradients;
   m_cGradients = m_cFF2.GetGradients();

Thanks to this simple trick, we have been able to avoid constant data copying between buffers and reduce the time required to perform operations within the class.

In conclusion, at the end of the initialization method, we call the SetOpenCL method to ensure that all our internal objects work in the same context. Now we exit the method with a positive result.

//--- pass a pointer to the object of work with OpenCL before all internal objects
   SetOpenCL(m_cOpenCL);
//---
   return true;
  }

The SetOpenCL method, called at the end of the initialization method, is designed to distribute the pointer to the OpenCL context work object among all internal objects. This is necessary to ensure that all objects work in the same space. This method was created as virtual in the base class of the neural layer. It is redefined in each new class as needed.

The algorithm of the method is quite simple, and we have already discussed it in all the previous classes. In the parameters, the method receives a pointer of the object of work with OpenCL context from an external program. We simply start by calling the method of the parent class and pass it the obtained pointer. The validation of the obtained pointer is already implemented in the parent class's method, so there is no need to repeat it here.

Then we pass the pointer to the OpenCL context to all internal objects stored in the variable of our class. The trick is that the method of the parent class checks the obtained pointer and stores the appropriate pointer in the variable. To ensure that all objects work in the same context, we propagate the already processed pointer.

bool CNeuronAttention::SetOpenCL(CMyOpenCL *opencl)
  {
   CNeuronBase::SetOpenCL(opencl);
   m_cQuerys.SetOpenCL(m_cOpenCL);
   m_cKeys.SetOpenCL(m_cOpenCL);
   m_cValues.SetOpenCL(m_cOpenCL);
   m_cAttentionOut.SetOpenCL(m_cOpenCL);
   m_cFF1.SetOpenCL(m_cOpenCL);
   m_cFF2.SetOpenCL(m_cOpenCL);
   if(m_cOpenCL)
     {
      m_cScores.BufferCreate(m_cOpenCL);
      ulong size = sizeof(TYPE) * m_cScores.Total();
      m_cScoreGrad = m_cOpenCL.AddBuffer((uint)sizeCL_MEM_READ_WRITE);
      m_cScoreTemp = m_cOpenCL.AddBuffer((uint)sizeCL_MEM_READ_WRITE);
      m_cStd.BufferCreate(m_cOpenCL);
     }
   else
     {
      m_cScores.BufferFree();
      m_cStd.BufferFree();
     }
//---
   return(!!m_cOpenCL);
  }

Going a bit ahead, I want to draw attention to the creation of the m_cScoreGrad and m_cScoreTemp buffers. They are used only in the OpenCL context for temporary data storage, so we did not create mirror objects for them in the main memory. Also, we will not use them to exchange data between the main program and the OpenCL context. In this case, we will create buffers in the OpenCL context, while on the side of the main program, we use only pointers to work with them. When disabling multi-threading technology, we immediately delete the mentioned buffers.

After completing the initialization method of the class, we can proceed to override the functional methods of our class.