MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

被引：54

作者：

Chen, Jiawei ^{[1
]}

Ho, Chiu Man ^{[1
]}

机构：

[1] OPPO US Res Ctr, Palo Alto, CA 94303 USA

来源：

2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022) | 2022年

关键词：

MOTION REPRESENTATION;

D O I：

10.1109/WACV51458.2022.00086

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.

引用

页码：786 / 797

页数：12

共 50 条

[41] The Multi-Modal Video Reasoning and Analyzing Competition
Peng, Haoran
Huang, He
Xu, Li
Li, Tianjiao
Liu, Jun
Rahmani, Hossein
Ke, Qiuhong
Guo, Zhicheng
Wu, Cong
Li, Rongchang
Ye, Mang
Wang, Jiahao
Zhang, Jiaxu
Liu, Yuanzhong
He, Tao
Zhang, Fuwei
Liu, Xianbin
Lin, Tao
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 806 - 813
[42] Multi-Modal Residual Perceptron Network for Audio-Video Emotion Recognition
Chang, Xin
Skarbek, Wladyslaw
SENSORS, 2021, 21 (16)
[43] Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection
Xuqiang Zhuang
Fangai Liu
Jian Hou
Jianhua Hao
Xiaohong Cai
Neural Processing Letters, 2022, 54 : 1943 - 1960
[44] Multi-modal video event recognition based on association rules and decision fusion
Guder, Mennan
Cicekli, Nihan Kesim
MULTIMEDIA SYSTEMS, 2018, 24 (01) : 55 - 72
[45] Multi-modal Gesture Recognition using Integrated Model of Motion, Audio and Video
GOUTSU Yusuke
KOBAYASHI Takaki
OBARA Junya
KUSAJIMA Ikuo
TAKEICHI Kazunari
TAKANO Wataru
NAKAMURA Yoshihiko
Chinese Journal of Mechanical Engineering, 2015, 28 (04) : 657 - 665
[46] Multi-modal video event recognition based on association rules and decision fusion
Mennan Güder
Nihan Kesim Çiçekli
Multimedia Systems, 2018, 24 : 55 - 72
[47] A Multi-Modal Egocentric Activity Recognition Approach towards Video Domain Generalization
Papadakis, Antonios
Spyrou, Evaggelos
SENSORS, 2024, 24 (08)
[48] Visual-guided hierarchical iterative fusion for multi-modal video action
Zhang, Bingbing
Zhang, Ying
Zhang, Jianxin
Sun, Qiule
Wang, Rong
Zhang, Qiang
PATTERN RECOGNITION LETTERS, 2024, 186 : 213 - 220
[49] Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer
Xiao, Zhenxiang
Chen, Yuzhong
Yao, Junjie
Zhang, Lu
Liu, Zhengliang
Wu, Zihao
Yu, Xiaowei
Pan, Yi
Zhao, Lin
Ma, Chong
Liu, Xinyu
Liu, Wei
Li, Xiang
Yuan, Yixuan
Shen, Dinggang
Zhu, Dajiang
Yao, Dezhong
Liu, Tianming
Jiang, Xi
INFORMATION FUSION, 2024, 104
[50] Multi-modal feature fusion based on multi-layers LSTM for video emotion recognition
Nie, Weizhi
Yan, Yan
Song, Dan
Wang, Kun
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16205 - 16214

← 1 2 3 4 5 →