Discovering Spatio-Temporal Rationales for Video Question Answering

被引:0
|
作者
Li, Yicong [1 ]
Xiao, Junbin [1 ]
Feng, Chun [2 ]
Wang, Xiang [2 ]
Chua, Tat-Seng [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Univ Sci & Technol China, Hefei, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.01275
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalization (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves new state-of-the-art (SoTA). Especially, on NExT-QA and Causal-VidQA which feature complex VideoQA, it significantly surpasses the previous SoTA by 5.8% and 6.8%, respectively. We then conduct extensive studies to verify the importance of STR as well as the proposed answer interaction mechanism. With the success of TranSTR and our comprehensive analysis, we hope this work can spark more future efforts in complex VideoQA. Code will be released at https://github.com/yl3800/TranSTR.
引用
收藏
页码:13823 / 13832
页数:10
相关论文
共 50 条
  • [31] DISCOVERING AND LINKING SPATIO-TEMPORAL BIG LINKED DATA
    Zinke, Christian
    Ngomo, Axel-Cyrille Ngonga
    [J]. IGARSS 2018 - 2018 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2018, : 411 - 414
  • [32] Discovering association patterns in large spatio-temporal databases
    Lee, Eric M. H.
    Chan, Keith C. C.
    [J]. ICDM 2006: SIXTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, WORKSHOPS, 2006, : 349 - +
  • [33] Discovering correlated spatio-temporal changes in evolving graphs
    Jeffrey Chan
    James Bailey
    Christopher Leckie
    [J]. Knowledge and Information Systems, 2008, 16 : 53 - 96
  • [34] Discovering spatio-temporal relationships in the distribution of building fires
    Spatenkova, Olga
    Virrantaus, Kirsi
    [J]. FIRE SAFETY JOURNAL, 2013, 62 : 49 - 63
  • [35] Discovering Spatio-temporal Relationships among IoT Services
    Huang, Bing
    Bouguettaya, Athman
    Neiat, Azadeh Ghari
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES (IEEE ICWS 2018), 2018, : 347 - 350
  • [36] SPATIO-TEMPORAL VIDEO FILTERING FOR VIDEO SURVEILLANCE APPLICATIONS
    Ben Hamida, Amal
    Koubaa, Mohamed
    Nicolas, Henri
    Ben Amar, Chokri
    [J]. ELECTRONIC PROCEEDINGS OF THE 2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2013,
  • [37] Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
    Zang, Chuanqi
    Wang, Hanqing
    Pei, Mingtao
    Liang, Wei
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19027 - 19036
  • [38] Spatio-temporal indexing of video in the wavelet domain
    Mandal, MK
    Panchanathan, S
    [J]. VISUAL COMMUNICATIONS AND IMAGE PROCESSING '99, PARTS 1-2, 1998, 3653 : 1542 - 1550
  • [39] Spatio-Temporal Scale Selection in Video Data
    Tony Lindeberg
    [J]. Journal of Mathematical Imaging and Vision, 2018, 60 : 525 - 562
  • [40] Video sequence matching with spatio-temporal constraints\
    Ren, W
    Singh, S
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3, 2004, : 834 - 837