MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

被引:27
|
作者
Gao, Difei [1 ]
Zhou, Luowei [2 ,5 ]
Ji, Lei [3 ]
Zhu, Linchao [4 ]
Yang, Yi [4 ]
Shou, Mike Zheng [1 ]
机构
[1] Natl Univ Singapore, Show Lab, Singapore, Singapore
[2] Microsoft, Albuquerque, NM USA
[3] Microsoft Res Asia, Beijing, Peoples R China
[4] Zhejiang Univ, Hangzhou, Peoples R China
[5] Google Brain, Mountain View, CA USA
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
基金
新加坡国家研究基金会;
关键词
D O I
10.1109/CVPR52729.2023.01419
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multimodal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multievent and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multimodal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at efficiency. The code is available at github.com/showlab/mist.
引用
收藏
页码:14773 / 14783
页数:11
相关论文
共 38 条
  • [21] Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network
    Jin, Weike
    Zhao, Zhou
    Li, Yimeng
    Li, Jie
    Xiao, Jun
    Zhuang, Yueting
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
  • [22] TRANSTL: SPATIAL-TEMPORAL LOCALIZATION TRANSFORMER FOR MULTI-LABEL VIDEO CLASSIFICATION
    Wu, Hongjun
    Li, Mengzhu
    Liu, Yongcheng
    Liu, Hongzhe
    Xu, Cheng
    Li, Xuewei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1965 - 1969
  • [23] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
    Ren, Shuhuai
    Chen, Sishuo
    Li, Shicheng
    Sun, Xu
    Hou, Lu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 932 - 947
  • [24] Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
    Zhang, Zhu
    Zhao, Zhou
    Lin, Zhijie
    Song, Jingkuan
    He, Xiaofei
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 4383 - 4389
  • [25] Multi-Modal Pedestrian Trajectory Prediction for Edge Agents Based on Spatial-Temporal Graph
    Zou, Xiangyu
    Sun, Bin
    Zhao, Duan
    Zhu, Zongwei
    Zhao, Jinjin
    He, Yongxin
    IEEE ACCESS, 2020, 8 : 83321 - 83332
  • [26] Graph based Spatial-temporal Fusion for Multi-modal Person Re-identification
    Zhang, Yaobin
    Lv, Jianming
    Liu, Chen
    Cai, Hongmin
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3736 - 3744
  • [27] MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer
    Lin, Fudong
    Crawford, Summer
    Guillot, Kaleb
    Zhang, Yihe
    Chen, Yan
    Yuan, Xu
    Chen, Li
    Williams, Shelby
    Minvielle, Robert
    Xiao, Xiangming
    Gholson, Drew
    Ashwell, Nicolas
    Setiyono, Tri
    Tubana, Brenda
    Peng, Lu
    Bayoumi, Magdy
    Tzeng, Nian-Feng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5751 - 5761
  • [28] VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs
    Abdessaied, Adnen
    Shi, Lei
    Bulling, Andreas
    2024 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, WACV 2024, 2024, : 5793 - 5802
  • [29] MMSTN: A Multi-Modal Spatial-Temporal Network for Tropical Cyclone Short-term Prediction
    Huang, Cheng
    Bai, Cong
    Chan, Sixian
    Zhang, Jinglin
    GEOPHYSICAL RESEARCH LETTERS, 2022, 49 (04)
  • [30] Exploring Multi-Modal Spatial-Temporal Contexts for High-Performance RGB-T Tracking
    Zhang, Tianlu
    Jiao, Qiang
    Zhang, Qiang
    Han, Jungong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4303 - 4318