5.Creating a new neural layer class

Let's get to the practical part and look at the implementation of our multi-head attention neural layer. To implement it, we create a new MHAttention class that inherits from the base class of all neural layers tf.keras.layers.Layer.

# Multi-Head Self-Attention Model
class MHAttention(tf.keras.layers.Layer):

First, we'll override the layer initialization method __init__. In the parameters of the initialization method, we will specify two constants:

key_size — size of the vector describing one element of the sequence in the tensor of Keys
heads — number of attention heads

In the body of the method, we will save the parameters in local variables for future use and immediately calculate the size of the concatenated output of attention heads into the variable m_iDimension.

For your convenience, I made an effort to repeat the names of variables from the MQL5 implementation as much as possible.

Next, we declare the internal objects of our neural layer. However, note that in this case, we do not specify the vector size of one element of the source data sequence. This is made possible by the use of multidimensional tensors.

The TensorFlow library works with multidimensional arrays or tensors represented as objects. This approach makes understanding the model more convenient and visual. To be able to implement the task in OpenCL, we were forced to use one-dimensional data buffers. To gain access to the required element, we calculated the offset in the one-dimensional buffer. Now, when using multidimensional arrays, to access the matrix element, we just need to specify the row and column of the element. It is convenient and clear.

Another advantage of this approach is that we do not need to specify the dimension of the source data. We can get it from the tensor itself. We will take advantage of this. We won't ask the user for the size of the description vector for one element of the input data sequence. Instead, we will receive the input data tensor as a matrix. Each line of such a matrix is a vector description of one element of the sequence. We can operate with the size of this vector. That is, the first dimension indicates the number of elements of the sequence, and the second means the length of the description vector of one element of the sequence.

However, there is also the other side of the coin. At the time of class initialization, we have not yet received the initial data. So, we do not know its size, as the user did not specify them in the parameters. Therefore, we cannot create all objects in the initialization method. But it doesn't matter. We will do what we can.

In the initialization method, we will declare objects that can be created without understanding the dimension of the source data:

m_cQuerys — neural layer for the formation of the concatenated tensor of queries Query
m_cKeys — neural layer for the formation of the concatenated tensor of keys Key
m_cValues — neural layer for the formation of the concatenated tensor of values Values
m_cNormAttention — data normalization layer for the Multi-Head Self-Attention block
m_cNormOutput — normalization layer for the results of the neural layer

  def __init__(self,key_size, heads, **kwargs):
    super(MHAttention, self).__init__(**kwargs)

    self.m_iHeads = heads
    self.m_iKeysSize = key_size
    self.m_iDimension=self.m_iHeads*self.m_iKeysSize;

    self.m_cQuerys = tf.keras.layers.Dense(self.m_iDimension)
    self.m_cKeys = tf.keras.layers.Dense(self.m_iDimension)
    self.m_cValues = tf.keras.layers.Dense(self.m_iDimension)
    self.m_cNormAttention=tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.m_cNormOutput=tf.keras.layers.LayerNormalization(epsilon=1e-6)

After creating the initialization method, we proceed to the build method. This method will allow us to initialize the missing objects. This method is run only once before the first call of the call method. It receives the source data size in the parameters. Knowing this size, we can initialize objects, structures, and/or parameters that depend on the size of the source data.

In the method body, we save the last dimension of the source data tensor as the size of the description vector of one element of the source data sequence to the m_iWindow local variable. After that, we will create three more internal neural layers:

m_cW0 — fully connected layer of the reduction matrix W0
m_cFF1 — the first fully connected layer of the Feed Forward block
m_cFF2 — the second fully connected layer of the Feed Forward block

  def build(self, input_shape):
    self.m_iWindow=input_shape[-1]
    self.m_cW0 = tf.keras.layers.Dense(self.m_iWindow)
    self.m_cFF1=tf.keras.layers.Dense(4*self.m_iWindow,
                                      activation=tf.nn.swish)
    self.m_cFF2=tf.keras.layers.Dense(self.m_iWindow)

So, we have defined all the internal objects necessary to implement the Multi-Head Self-Attention algorithm inside our new layer. Before proceeding with the implementation, let's once again look at how we can write the algorithm of multi-head attention using matrix mathematics since when working with multidimensional tensors, we must operate with matrix operations.

The first step is to define the Query, Key, and Value tensors. To obtain query data, we need to multiply the tensor of the source data by the corresponding matrix of weights. This operation is performed in three internal neural layers.

  def call(self, data):
    batch_size = tf.shape(data)[0]
    query = self.m_cQuerys(data)
    key = self.m_cKeys(data)
    value = self.m_cValues(data)

The second step is to determine the matrix of dependency coefficients. According to the Self-Attention algorithm, we first need to multiply the query tensor by the transposed key tensor.

Everything is simple for just one attention head. But we have concatenated tensors, which in the last dimension contain the data of all attention heads. Multiplying them in this form will give us a result comparable to one-headed attention. As an option, we can transform the two-dimensional tensor into a three-dimensional one, separating the attention head into a distinct dimension.

Multiplying the last two dimensions in this form is also not quite what we would like to get. However, if we swap the first and second dimensions, then we can multiply the last two dimensions to get the result we are looking for.

The described procedure will be placed in a separate function split_heads.

  def split_heads(self, x, batch_size):
    x = tf.reshape(x, (batch_size, -1,
                                self.m_iHeads,
                                self.m_iKeysSize))
    return tf.transpose(x, perm=[0, 2, 1, 3])

Inside the call method, we transform tensors and multiply them according to the Self-Attention algorithm.

    query = self.split_heads(query, batch_size)
    key = self.split_heads(key, batch_size)
    value = self.split_heads(value, batch_size)
    score = tf.matmul(query, key, transpose_b=True)

Next, we need to divide the obtained dependence coefficients by the square root of the dimension of the key vector and normalize it with the Softmax function according to the last dimension of the tensor.

score = score / tf.math.sqrt(tf.cast(self.m_iKeysSize, tf.float32))
score = tf.nn.softmax(score, axis=-1)

Now we only need to multiply the normalized dependency coefficients by the Value tensor.

attention = tf.matmul(score, value)

As a result of this operation, we will get the attention block result for each attention head. To continue the algorithm, we need a concatenated tensor of all attention heads. Therefore, we need to carry out the reverse procedure of the tensor transformation. Once again, we rearrange the first and second dimensions and change the dimension of the tensor from three-dimensional to two-dimensional.

attention = tf.transpose(attention, perm=[0, 2, 1, 3])
attention = tf.reshape(attention,(batch_size, -1, self.m_iDimension))

After that, using the W0 matrix, we convert the concatenated tensor of the results to the size of the tensor of the initial data. Add the two tensors and normalize the result.

attention = self.m_cW0(attention)
attention=self.m_cNormAttention(data + attention)

This concludes the first block of the Multi-Head Self-Attention algorithm, followed by two consecutive fully connected layers of the Feed Forward block. The first neural layer will be with the Swish activation function, and the second one will have no activation function.

output=self.m_cFF1(attention)
output=self.m_cFF2(output)

At the end of the method, we add the result tensors of the Multi-Head Self-Attention and Feed Forward blocks and normalize the layer. The result of the operations is returned in the form of a tensor.

output=self.m_cNormOutput(attention+output)
return output

We have implemented a minimal set of methods of the class, sufficient to test its functionality. However, we will not be able to save the model with this class in this form. This is not good because our goal is to build and train a model with the subsequent possibility of practical use. Therefore, the ability to save the model and then restore it is one of the key requirements.

First, to enable the saving of the new object, which is our neural layer, it is necessary to add it to the list of custom objects and provide serialization capabilities for the object. This allows us to make a directive register_keras_serializable, which we will add before declaring the class of our neural layer.

# Multi-Head Self-Attention model
@tf.keras.utils.register_keras_serializable(package="Custom", name='MHAttention')
class MHAttention(tf.keras.layers.Layer):

But that's not all. We still need to add the get_config method, which will return the contents of variables to save to a file. Note that among the variables there are both those specified by the user when initializing the class object and those saved from the size of the initial data. Our weights are tuned to these dimensions.

  def get_config(self):
    config={'key_size': self.m_iKeysSize,
            'heads': self.m_iHeads,
            'dimension': self.m_iDimension,
            'window': self.m_iWindow
            }
    base_config = super(MHAttention, self).get_config()
    return dict(list(base_config.items()) + list(config.items()))

The from_config method is responsible for restoring data from the configuration list. However, please note the following. In the usual logic, the parameters from the class initialization method are specified in the configuration dictionary. But we also saved data that depends on the size of the initial data. And, as you remember, they are not included in the parameters of the initialization method. In its pure form, we will get an error about the presence of unknown parameters. Therefore, at the beginning of the method, we remove them from the configuration directory, but at the same time save the values to local variables. And only after that, we restore the layer.

  @classmethod
  def from_config(cls, config):
    dimension=config.pop('dimension')
    window=config.pop('window')
    layer = cls(**config)
    layer._build_from_signature(dimension, window)
    return layer

After initializing our neural layer from the configuration dictionary, we need to pass the values we previously extracted about the configuration of the input data into the respective variables. To perform this functionality, we will call the _build_from_signature method, which we will also have to override.

  def _build_from_signature(self, dimension, window):
    self.m_iDimension=dimension
    self.m_iWindow=window

With that, we conclude our work on the class of our neural layer and can move on to creating a model to test the newly created Multi-Head Self-Attention neural layer.

Building Multi-Head Self-Attention in Python

5.2.4.2 Creating a script to test Multi-Head Self-Attention