As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
Video classification is a challenging task because of the intricate spatiotemporal information present within videos. Current models often rely on 2D or 3D convolutional neural networks. However, convolutional neural networks are difficult to solve the long-range dependency problem. In addition, they are computationally expensive and memory-intensive. To address the challenges, a Multi-layer Transformer is proposed for video classification. The proposed method takes advantage of the high correlation between adjacent frames by grouping them and learning local and global information with a multi-layer structure based on Transformer. First, different frame sampling rates and grouping strategies are tested in the experiments, then comparing the method with state-of-the-art models. The results demonstrate that the proposed method has advanced performance with TOP1 accuracy of 77.8% on the Kinetics-400 dataset and 64.9% on the Something-Something v2 dataset.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.