Discussing the article: "Neural networks made easy (Part 77): Cross-Covariance Transformer (XCiT)"

 

Check out the new article: Neural networks made easy (Part 77): Cross-Covariance Transformer (XCiT).

In our models, we often use various attention algorithms. And, probably, most often we use Transformers. Their main disadvantage is the resource requirement. In this article, we will consider a new algorithm that can help reduce computing costs without losing quality.

Transformers show great potential in solving problems of analyzing various sequences. The Self-Attention operation which underlies transformers, provides global interactions between all tokens in the sequence. This makes it possible to evaluate interdependencies within the entire analyzed sequence. However, this comes with quadratic complexity in terms of computation time and memory usage, making it difficult to apply the algorithm to long sequences.

To solve this problem, the authors of the paper "XCiT: Cross-Covariance Image Transformers" suggested a "transposed" version of Self-Attention, which operates through feature channels rather than tokens, where interactions are based on a cross-covariance matrix between keys and queries. The result is cross-covariance attention (XCA) with linear complexity in the number of tokens, allowing for efficient processing of large data sequences. Cross-covariance image transformer (XCiT), based on XCA, combines the accuracy of conventional transformers with the scalability of convolutional architectures. That paper experimentally confirms the effectiveness and generality of XCiT. The presented experiments demonstrate excellent results on several visual benchmarks, including image classification, object detection, and instance segmentation.

Author: Dmitriy Gizlyk