MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

被引:54
|
作者
Chen, Jiawei [1 ]
Ho, Chiu Man [1 ]
机构
[1] OPPO US Res Ctr, Palo Alto, CA 94303 USA
关键词
MOTION REPRESENTATION;
D O I
10.1109/WACV51458.2022.00086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.
引用
收藏
页码:786 / 797
页数:12
相关论文
共 50 条
  • [1] Multi-Modal Multi-Action Video Recognition
    Shi, Zhensheng
    Liang, Ju
    Li, Qianqian
    Zheng, Haiyong
    Gu, Zhaorui
    Dong, Junyu
    Zheng, Bing
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13658 - 13667
  • [2] Language-guided Multi-Modal Fusion for Video Action Recognition
    Hsiao, Jenhao
    Li, Yikang
    Ho, Chiuman
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3151 - 3155
  • [3] Compressed Video Action Recognition With Dual-Stream and Dual-Modal Transformer
    Mou, Yuting
    Jiang, Xinghao
    Xu, Ke
    Sun, Tanfeng
    Wang, Zepeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3299 - 3312
  • [4] Multi-modal Laughter Recognition in Video Conversations
    Escalera, Sergio
    Puertas, Eloi
    Radeva, Petia
    Pujol, Oriol
    2009 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPR WORKSHOPS 2009), VOLS 1 AND 2, 2009, : 869 - 874
  • [5] On Pursuit of Designing Multi-modal Transformer for Video Grounding
    Cao, Meng
    Chen, Long
    Shou, Zheng
    Zhang, Can
    Zou, Yuexian
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9810 - 9823
  • [6] A discriminative multi-modal adaptation neural network model for video action recognition
    Gao, Lei
    Liu, Kai
    Guan, Ling
    NEURAL NETWORKS, 2025, 185
  • [7] Multi-modal Transformer for Indoor Human Action Recognition
    Do, Jeonghyeok
    Kim, Munchurl
    2022 22ND INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS (ICCAS 2022), 2022, : 1155 - 1160
  • [8] Multi-Modal Emotion Recognition Fusing Video and Audio
    Xu, Chao
    Du, Pufeng
    Feng, Zhiyong
    Meng, Zhaopeng
    Cao, Tianyi
    Dong, Caichao
    APPLIED MATHEMATICS & INFORMATION SCIENCES, 2013, 7 (02): : 455 - 462
  • [9] Everything at Once - Multi-modal Fusion Transformer for Video Retrieval
    Shvetsova, Nina
    Chen, Brian
    Rouditchenko, Andrew
    Thomas, Samuel
    Kingsbury, Brian
    Feris, Rogerio
    Harwath, David
    Glass, James
    Kuehne, Hilde
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19988 - 19997
  • [10] A comprehensive video dataset for multi-modal recognition systems
    Handa A.
    Agarwal R.
    Kohli N.
    Data Science Journal, 2019, 18 (01):