MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

被引:54
|
作者
Chen, Jiawei [1 ]
Ho, Chiu Man [1 ]
机构
[1] OPPO US Res Ctr, Palo Alto, CA 94303 USA
关键词
MOTION REPRESENTATION;
D O I
10.1109/WACV51458.2022.00086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.
引用
收藏
页码:786 / 797
页数:12
相关论文
共 50 条
  • [31] A Multi-modal System for Video Semantic Understanding
    Lv, Zhengwei
    Lei, Tao
    Liang, Xiao
    Shi, Zhizhong
    Liu, Duoxing
    CCKS 2021 - EVALUATION TRACK, 2022, 1553 : 34 - 43
  • [32] Hierarchically multi-modal indexing of soccer video
    Liu, Yuchi
    Wu, Lingda
    Lei, Zhen
    Xie, Yuxiang
    12TH INTERNATIONAL MULTI-MEDIA MODELLING CONFERENCE PROCEEDINGS, 2006, : 393 - 396
  • [33] Multi-modal Dependency Tree for Video Captioning
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [34] Multi-modal tracking of faces for video communications
    Crowley, JL
    Berard, F
    1997 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 1997, : 640 - 645
  • [35] Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection
    Zhuang, Xuqiang
    Liu, Fangai
    Hou, Jian
    Hao, Jianhua
    Cai, Xiaohong
    NEURAL PROCESSING LETTERS, 2022, 54 (03) : 1943 - 1960
  • [36] TMTC: trusted multi-modal transformer classification framework for video frame deletion detection
    Chunhui Feng
    Yongxiang Zhong
    Yigong Huang
    Xiaolong Liu
    The Journal of Supercomputing, 81 (7)
  • [37] Multi-modal humor segment prediction in video
    Zekun Yang
    Yuta Nakashima
    Haruo Takemura
    Multimedia Systems, 2023, 29 : 2389 - 2398
  • [38] Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video
    GOUTSU Yusuke
    KOBAYASHI Takaki
    OBARA Junya
    KUSAJIMA Ikuo
    TAKEICHI Kazunari
    TAKANO Wataru
    NAKAMURA Yoshihiko
    Chinese Journal of Mechanical Engineering, 2015, (04) : 657 - 665
  • [39] Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video
    Goutsu, Yusuke
    Kobayashi, Takaki
    Obara, Junya
    Kusajima, Ikuo
    Takeichi, Kazunari
    Takano, Wataru
    Nakamura, Yoshihiko
    CHINESE JOURNAL OF MECHANICAL ENGINEERING, 2015, 28 (04) : 657 - 665
  • [40] Multi-modal gesture recognition using integrated model of motion, audio and video
    Yusuke Goutsu
    Takaki Kobayashi
    Junya Obara
    Ikuo Kusajima
    Kazunari Takeichi
    Wataru Takano
    Yoshihiko Nakamura
    Chinese Journal of Mechanical Engineering, 2015, 28 : 657 - 665