Discovering Spatio-Temporal Rationales for Video Question Answering

被引:0
|
作者
Li, Yicong [1 ]
Xiao, Junbin [1 ]
Feng, Chun [2 ]
Wang, Xiang [2 ]
Chua, Tat-Seng [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Univ Sci & Technol China, Hefei, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.01275
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalization (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves new state-of-the-art (SoTA). Especially, on NExT-QA and Causal-VidQA which feature complex VideoQA, it significantly surpasses the previous SoTA by 5.8% and 6.8%, respectively. We then conduct extensive studies to verify the importance of STR as well as the proposed answer interaction mechanism. With the success of TranSTR and our comprehensive analysis, we hope this work can spark more future efforts in complex VideoQA. Code will be released at https://github.com/yl3800/TranSTR.
引用
收藏
页码:13823 / 13832
页数:10
相关论文
共 50 条
  • [41] Spatio-temporal indexing of video in the wavelet domain
    Mandal, MK
    Panchanathan, S
    [J]. VISUAL COMMUNICATIONS AND IMAGE PROCESSING '99, PARTS 1-2, 1998, 3653 : 1542 - 1550
  • [42] Spatio-Temporal Scale Selection in Video Data
    Tony Lindeberg
    [J]. Journal of Mathematical Imaging and Vision, 2018, 60 : 525 - 562
  • [43] Video Segmentation by Spatio-temporal Random Walk
    Chang, Jing
    Wang, Hui
    [J]. PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON E-BUSINESS, INFORMATION MANAGEMENT AND COMPUTER SCIENCE, 2018, : 54 - 58
  • [44] Spatio-temporal pattern mining in sports video
    Lan, DJ
    Ma, YF
    Ma, WY
    Zhang, HJ
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2004, PT 2, PROCEEDINGS, 2004, 3332 : 306 - 313
  • [45] Spatio-Temporal Scale Selection in Video Data
    Lindeberg, Tony
    [J]. SCALE SPACE AND VARIATIONAL METHODS IN COMPUTER VISION, SSVM 2017, 2017, 10302 : 3 - 15
  • [46] A spatio-temporal pyramid matching for video retrieval
    Choi, Jaesik
    Wang, Ziyu
    Lee, Sang-Chul
    Jeon, Won J.
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2013, 117 (06) : 660 - 669
  • [47] Spatio-temporal scalability for MPEG video coding
    Domanski, M
    Luczak, A
    Mackowiak, S
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2000, 10 (07) : 1088 - 1093
  • [48] Video anomaly detection with spatio-temporal dissociation
    Chang, Yunpeng
    Tu, Zhigang
    Xie, Wei
    Luo, Bin
    Zhang, Shifu
    Sui, Haigang
    Yuan, Junsong
    [J]. PATTERN RECOGNITION, 2022, 122
  • [49] Spatio-temporal pattern mining in sports video
    Lan, Dong-Jun
    Ma, Yu-Fei
    Ma, Wei-Ying
    Zhang, Hong-Jiang
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2004, 3332 : 306 - 313
  • [50] Video coding with spatio-temporal texture synthesis
    Zhu, Chunbo
    Sun, Xiaoyan
    Wu, Feng
    Li, Houqiang
    [J]. 2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 112 - +