Discovering Spatio-Temporal Rationales for Video Question Answering

被引:0
|
作者
Li, Yicong [1 ]
Xiao, Junbin [1 ]
Feng, Chun [2 ]
Wang, Xiang [2 ]
Chua, Tat-Seng [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Univ Sci & Technol China, Hefei, Peoples R China
关键词
D O I
10.1109/ICCV51070.2023.01275
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalization (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves new state-of-the-art (SoTA). Especially, on NExT-QA and Causal-VidQA which feature complex VideoQA, it significantly surpasses the previous SoTA by 5.8% and 6.8%, respectively. We then conduct extensive studies to verify the importance of STR as well as the proposed answer interaction mechanism. With the success of TranSTR and our comprehensive analysis, we hope this work can spark more future efforts in complex VideoQA. Code will be released at https://github.com/yl3800/TranSTR.
引用
收藏
页码:13823 / 13832
页数:10
相关论文
共 50 条
  • [1] Video Question Answering with Spatio-Temporal Reasoning
    Jang, Yunseok
    Song, Yale
    Kim, Chris Dongjoo
    Yu, Youngjae
    Kim, Youngjin
    Kim, Gunhee
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (10) : 1385 - 1412
  • [2] Video Question Answering with Spatio-Temporal Reasoning
    Yunseok Jang
    Yale Song
    Chris Dongjoo Kim
    Youngjae Yu
    Youngjin Kim
    Gunhee Kim
    [J]. International Journal of Computer Vision, 2019, 127 : 1385 - 1412
  • [3] Spatio-Temporal Context Networks for Video Question Answering
    Gao, Kun
    Han, Yahong
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 108 - 118
  • [4] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [5] Spatio-Temporal Two-stage Fusion for video question answering
    Xu, Feifei
    Zhu, Yitao
    Wang, Chun
    Cao, Yangze
    Zhong, Zheng
    Li, Xiongmin
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
  • [6] Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
    Zhao, Zhou
    Yang, Qifan
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3518 - 3524
  • [7] Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
    Dang, Long Hoang
    Le, Thao Minh
    Le, Vuong
    Tran, Truyen
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 636 - 642
  • [8] Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
    Jiang, Jianwen
    Chen, Ziqiang
    Lin, Haojie
    Zhao, Xibin
    Gao, Yue
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11101 - 11108
  • [9] Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
    Cheng, Yi
    Fan, Hehe
    Lin, Dongyun
    Sun, Ying
    Kankanhalli, Mohan
    Lim, Joo-Hwee
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6131 - 6141
  • [10] (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
    Cherian, Anoop
    Hori, Chiori
    Marks, Tim K.
    Le Roux, Jonathan
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 444 - 453