Discussing the article: "Neural networks made easy (Part 79): Feature Aggregated Queries (FAQ) in the context of state"

 

Check out the new article: Neural networks made easy (Part 79): Feature Aggregated Queries (FAQ) in the context of state.

In the previous article, we got acquainted with one of the methods for detecting objects in an image. However, processing a static image is somewhat different from working with dynamic time series, such as the dynamics of the prices we analyze. In this article, we will consider the method of detecting objects in video, which is somewhat closer to the problem we are solving.

Most of the methods we discussed earlier analyze the state of the environment as something static, which fully corresponds to the definition of a Markov process. Naturally, we filled the description of the environment state with historical data to provide the model with as much necessary information as possible. But the model does not evaluate the dynamics of changes in states. This also refers to the method presented in the previous article: DFFT was developed for detecting objects in static images.

However, observations of price movements indicate that the dynamics of changes can sometimes indicate the strength and direction of the upcoming movement with sufficient probability. Logically, we now turn our attention to methods for detecting objects in video.

Object detection in video has a number of certain characteristics and must solve the problem of changes in object features caused by motion, which are not encountered in the image domain. One of the solutions is to use temporal information and combine features from adjacent frames. The paper "FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors" proposes a new approach to detecting objects in video. The authors of the article improve the quality of queries for Transformer-based models by aggregating them. To achieve this goal, a practical method is proposed to generate and aggregate queries according to the features of the input frames. Extensive experimental results provided in the paper validate the effectiveness of the proposed method. The proposed approaches can be extended to a wide range of methods for detecting objects in images and videos to improve their efficiency.

Author: Dmitriy Gizlyk