MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

被引：27

作者：

Gao, Difei ^{[1
]}

Zhou, Luowei ^{[2
,5
]}

Ji, Lei ^{[3
]}

Zhu, Linchao ^{[4
]}

Yang, Yi ^{[4
]}

Shou, Mike Zheng ^{[1
]}

机构：

[1] Natl Univ Singapore, Show Lab, Singapore, Singapore

[2] Microsoft, Albuquerque, NM USA

[3] Microsoft Res Asia, Beijing, Peoples R China

[4] Zhejiang Univ, Hangzhou, Peoples R China

[5] Google Brain, Mountain View, CA USA

来源：

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年

基金：

新加坡国家研究基金会;

关键词：

D O I：

10.1109/CVPR52729.2023.01419

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multimodal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multievent and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multimodal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at efficiency. The code is available at github.com/showlab/mist.

引用

页码：14773 / 14783

页数：11

共 38 条

[1] MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
Ahmad, Mobeen
Park, Geonwoo
Park, Dongchan
Park, Sanguk
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4659 - 4664
[2] Temporally Multi-Modal Semantic Reasoning with Spatial Language Constraints for Video Question Answering
Liu, Mingyang
Wang, Ruomei
Zhou, Fan
Lin, Ge
SYMMETRY-BASEL, 2022, 14 (06):
[3] Differentiated Attention with Multi-modal Reasoning for Video Question Answering
Yao, Shentao
Li, Kun
Xing, Kun
Wu, Kewei
Xie, Zhao
Guo, Dan
2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, BIG DATA AND ALGORITHMS (EEBDA), 2022, : 525 - 530
[4] Harnessing Representative Spatial-Temporal Information for Video Question Answering
Wang, Yuanyuan
Liu, Meng
Song, Xuemeng
Nie, Liqiang
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (10)
[5] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
Siebert, Tim
Clasen, Kai Norman
Ravanbakhsh, Mahdyar
Demir, Beguem
IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
[6] Long-Form Video Question Answering via Dynamic Hierarchical Reinforced Networks
Zhao, Zhou
Zhang, Zhu
Xiao, Shuwen
Xiao, Zhenxin
Yan, Xiaohui
Yu, Jun
Cai, Deng
Wu, Fei
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (12) : 5939 - 5952
[7] Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network
Liu, Meng
Zhang, Fenglei
Luo, Xin
Liu, Fan
Wei, Yinwei
Nie, Liqiang
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3985 - 3993
[8] Hierarchical Multi-Task Learning for Diagram Question Answering with Multi-Modal Transformer
Yuan, Zhaoquan
Peng, Xiao
Wu, Xiao
Xu, Changsheng
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1313 - 1321
[9] Multi-modal spatial relational attention networks for visual question answering
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Sun, Yuxin
Zhang, Zhi
Luo, Yongkang
IMAGE AND VISION COMPUTING, 2023, 140
[10] Open-Domain Long-Form Question–Answering Using Transformer-Based Pipeline
Dash A.
Awachar M.
Patel A.
Rudra B.
SN Computer Science, 4 (5)

← 1 2 3 4 →