Self-Attention
The models described above utilize recurrent blocks, the training of which incurs significant costs. In June 2017, in the article Attention Is All You Need the authors proposed a new neural network architecture called the Transformer, which eliminated the use of recurrent blocks and proposed a new Self-Attention algorithm. In contrast to the algorithm described above, the Self-Attention algorithm analyzes pairwise dependencies within the same sequence. The Transformer performed better on the tests, and today the model and its derivatives are used in many models, including GPT-2 and GPT-3. We will consider the Self-Attention algorithm in more detail.