MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

被引：27

作者：

Gao, Difei ^{[1
]}

Zhou, Luowei ^{[2
,5
]}

Ji, Lei ^{[3
]}

Zhu, Linchao ^{[4
]}

Yang, Yi ^{[4
]}

Shou, Mike Zheng ^{[1
]}

机构：

[1] Natl Univ Singapore, Show Lab, Singapore, Singapore

[2] Microsoft, Albuquerque, NM USA

[3] Microsoft Res Asia, Beijing, Peoples R China

[4] Zhejiang Univ, Hangzhou, Peoples R China

[5] Google Brain, Mountain View, CA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52729.2023.01419

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multimodal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multievent and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multimodal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at efficiency. The code is available at github.com/showlab/mist.

引用

页码：14773 / 14783

页数：11

共 38 条

[31] Co-Attending Free-Form Regions and Detections with Multi-Modal Multiplicative Feature Embedding for Visual Question Answering
Lu, Pan
Li, Hongsheng
Zhang, Wei
Wang, Jianyong
Wang, Xiaogang
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7218 - 7225
[32] CFMMC-Align: Coarse-Fine Multi-Modal Contrastive Alignment Network for Traffic Event Video Question Answering
Guo, Kan
Tian, Daxin
Hu, Yongli
Lin, Chunmian
Sun, Yanfeng
Zhou, Jianshan
Duan, Xuting
Gao, Junbin
Yin, Baocai
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 10538 - 10550
[33] Attention-guided video super-resolution with recurrent multi-scale spatial-temporal transformer
Sun, Wei
Kong, Xianguang
Zhang, Yanning
COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (04) : 3989 - 4002
[34] Lie Recognition with Multi-Modal Spatial-Temporal State Transition Patterns Based on Hybrid Convolutional Neural Network-Bidirectional Long Short-Term Memory
Abdullahi, Sunusi Bala
Bature, Zakariyya Abdullahi
Gabralla, Lubna A.
Chiroma, Haruna
BRAIN SCIENCES, 2023, 13 (04)
[35] Multi-modal hybrid modeling strategy based on Gaussian Mixture Variational Autoencoder and spatial-temporal attention: Application to industrial process prediction
Peng, Haifei
Long, Jian
Huang, Cheng
Wei, Shibo
Ye, Zhencheng
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2024, 244
[36] Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question Answering
Yu, Ting
Fu, Kunhao
Zhang, Jian
Huang, Qingming
Yu, Jun
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 3115 - 3129
[37] TFF-temporal fusion framework for advancing video retrieval through long-range dependencies and multi-modal intent
Pratibha Singh
Kashvi Chakrawal
Alok Kumar Singh Kushwaha
Machine Vision and Applications, 2025, 36 (3)
[38] Space-time super-resolution for satellite video: A joint framework based on multi-scale spatial-temporal transformer
Xiao, Yi
Yuan, Qiangqiang
He, Jiang
Zhang, Qiang
Sun, Jing
Su, Xin
Wu, Jialian
Zhang, Liangpei
INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2022, 108

← 1 2 3 4 →