MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

被引：54

作者：

Chen, Jiawei ^{[1
]}

Ho, Chiu Man ^{[1
]}

机构：

[1] OPPO US Res Ctr, Palo Alto, CA 94303 USA

来源：

2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022) | 2022年

关键词：

MOTION REPRESENTATION;

D O I：

10.1109/WACV51458.2022.00086

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.

引用

页码：786 / 797

页数：12

共 50 条

[1] Multi-Modal Multi-Action Video Recognition
Shi, Zhensheng
Liang, Ju
Li, Qianqian
Zheng, Haiyong
Gu, Zhaorui
Dong, Junyu
Zheng, Bing
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13658 - 13667
[2] Language-guided Multi-Modal Fusion for Video Action Recognition
Hsiao, Jenhao
Li, Yikang
Ho, Chiuman
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3151 - 3155
[3] Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer
Mou, Yuting
Jiang, Xinghao
Xu, Ke
Sun, Tanfeng
Wang, Zepeng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3299 - 3312
[4] Multi-modal Laughter Recognition in Video Conversations
Escalera, Sergio
Puertas, Eloi
Radeva, Petia
Pujol, Oriol
2009 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPR WORKSHOPS 2009), VOLS 1 AND 2, 2009, : 869 - 874
[5] On Pursuit of Designing Multi-modal Transformer for Video Grounding
Cao, Meng
Chen, Long
Shou, Zheng
Zhang, Can
Zou, Yuexian
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9810 - 9823
[6] A discriminative multi-modal adaptation neural network model for video action recognition
Gao, Lei
Liu, Kai
Guan, Ling
NEURAL NETWORKS, 2025, 185
[7] Multi-modal Transformer for Indoor Human Action Recognition
Do, Jeonghyeok
Kim, Munchurl
2022 22ND INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS (ICCAS 2022), 2022, : 1155 - 1160
[8] Multi-Modal Emotion Recognition Fusing Video and Audio
Xu, Chao
Du, Pufeng
Feng, Zhiyong
Meng, Zhaopeng
Cao, Tianyi
Dong, Caichao
APPLIED MATHEMATICS & INFORMATION SCIENCES, 2013, 7 (02): : 455 - 462
[9] Everything at Once - Multi-modal Fusion Transformer for Video Retrieval
Shvetsova, Nina
Chen, Brian
Rouditchenko, Andrew
Thomas, Samuel
Kingsbury, Brian
Feris, Rogerio
Harwath, David
Glass, James
Kuehne, Hilde
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19988 - 19997
[10] A comprehensive video dataset for multi-modal recognition systems
Handa A.
Agarwal R.
Kohli N.
Data Science Journal, 2019, 18 (01):

← 1 2 3 4 5 →