MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

被引:54
|
作者
Chen, Jiawei [1 ]
Ho, Chiu Man [1 ]
机构
[1] OPPO US Res Ctr, Palo Alto, CA 94303 USA
关键词
MOTION REPRESENTATION;
D O I
10.1109/WACV51458.2022.00086
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.
引用
收藏
页码:786 / 797
页数:12
相关论文
共 50 条
  • [41] The Multi-Modal Video Reasoning and Analyzing Competition
    Peng, Haoran
    Huang, He
    Xu, Li
    Li, Tianjiao
    Liu, Jun
    Rahmani, Hossein
    Ke, Qiuhong
    Guo, Zhicheng
    Wu, Cong
    Li, Rongchang
    Ye, Mang
    Wang, Jiahao
    Zhang, Jiaxu
    Liu, Yuanzhong
    He, Tao
    Zhang, Fuwei
    Liu, Xianbin
    Lin, Tao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 806 - 813
  • [42] Multi-Modal Residual Perceptron Network for Audio-Video Emotion Recognition
    Chang, Xin
    Skarbek, Wladyslaw
    SENSORS, 2021, 21 (16)
  • [43] Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection
    Xuqiang Zhuang
    Fangai Liu
    Jian Hou
    Jianhua Hao
    Xiaohong Cai
    Neural Processing Letters, 2022, 54 : 1943 - 1960
  • [44] Multi-modal video event recognition based on association rules and decision fusion
    Guder, Mennan
    Cicekli, Nihan Kesim
    MULTIMEDIA SYSTEMS, 2018, 24 (01) : 55 - 72
  • [45] Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video
    GOUTSU Yusuke
    KOBAYASHI Takaki
    OBARA Junya
    KUSAJIMA Ikuo
    TAKEICHI Kazunari
    TAKANO Wataru
    NAKAMURA Yoshihiko
    Chinese Journal of Mechanical Engineering, 2015, 28 (04) : 657 - 665
  • [46] Multi-modal video event recognition based on association rules and decision fusion
    Mennan Güder
    Nihan Kesim Çiçekli
    Multimedia Systems, 2018, 24 : 55 - 72
  • [47] A Multi-Modal Egocentric Activity Recognition Approach towards Video Domain Generalization
    Papadakis, Antonios
    Spyrou, Evaggelos
    SENSORS, 2024, 24 (08)
  • [48] Visual-guided hierarchical iterative fusion for multi-modal video action
    Zhang, Bingbing
    Zhang, Ying
    Zhang, Jianxin
    Sun, Qiule
    Wang, Rong
    Zhang, Qiang
    PATTERN RECOGNITION LETTERS, 2024, 186 : 213 - 220
  • [49] Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer
    Xiao, Zhenxiang
    Chen, Yuzhong
    Yao, Junjie
    Zhang, Lu
    Liu, Zhengliang
    Wu, Zihao
    Yu, Xiaowei
    Pan, Yi
    Zhao, Lin
    Ma, Chong
    Liu, Xinyu
    Liu, Wei
    Li, Xiang
    Yuan, Yixuan
    Shen, Dinggang
    Zhu, Dajiang
    Yao, Dezhong
    Liu, Tianming
    Jiang, Xi
    INFORMATION FUSION, 2024, 104
  • [50] Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition
    Nie, Weizhi
    Yan, Yan
    Song, Dan
    Wang, Kun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16205 - 16214