Dynamic Spatio-Temporal Modular Network for Video Question Answering

被引:3
|
作者
Qian, Zi [1 ]
Wang, Xin [1 ]
Duan, Xuguang [1 ]
Chen, Hong [1 ]
Zhu, Wenwu [1 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
video question answering; modular neural network;
D O I
10.1145/3503161.3548061
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video Question Answering (VideoQA) aims to understand given videos and questions comprehensively by generating correct answers. However, existing methods usually rely on end-to-end blackbox deep neural networks to infer the answers, which significantly differs from human logic reasoning, thus lacking the ability to explain. Besides, the performances of existing methods tend to drop when answering compositional questions involving realistic scenarios. To tackle these challenges, we propose a Dynamic Spatio-Temporal Modular Network (DSTN) model, which utilizes a spatio-temporal modular network to simulate the compositional reasoning procedure of human beings. Concretely, we divide the task of answering a given question into a set of sub-tasks focusing on certain key concepts in questions and videos such as objects, actions, temporal orders, etc. Each sub-task can be solved with a separately designed module, e.g., spatial attention module, temporal attention module, logic module, and answer module. Then we dynamically assemble different modules assigned with different sub-tasks to generate a tree-structured spatio-temporal modular neural network for human-like reasoning before producing the final answer for the question. We carry out extensive experiments on the AGQA dataset to demonstrate our proposed DSTN model can significantly outperform several baseline methods in various settings. Moreover, we evaluate intermediate results and visualize each reasoning step to verify the rationality of different modules and the explainability of the proposed DSTN model.
引用
收藏
页码:4466 / 4477
页数:12
相关论文
共 50 条
  • [1] Video Question Answering with Spatio-Temporal Reasoning
    Jang, Yunseok
    Song, Yale
    Kim, Chris Dongjoo
    Yu, Youngjae
    Kim, Youngjin
    Kim, Gunhee
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (10) : 1385 - 1412
  • [2] Video Question Answering with Spatio-Temporal Reasoning
    Yunseok Jang
    Yale Song
    Chris Dongjoo Kim
    Youngjae Yu
    Youngjin Kim
    Gunhee Kim
    [J]. International Journal of Computer Vision, 2019, 127 : 1385 - 1412
  • [3] Spatio-Temporal Context Networks for Video Question Answering
    Gao, Kun
    Han, Yahong
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 108 - 118
  • [4] Discovering Spatio-Temporal Rationales for Video Question Answering
    Li, Yicong
    Xiao, Junbin
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13823 - 13832
  • [5] Spatio-Temporal Two-stage Fusion for video question answering
    Xu, Feifei
    Zhu, Yitao
    Wang, Chun
    Cao, Yangze
    Zhong, Zheng
    Li, Xiongmin
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
  • [6] Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
    Zhao, Zhou
    Yang, Qifan
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3518 - 3524
  • [7] Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
    Dang, Long Hoang
    Le, Thao Minh
    Le, Vuong
    Tran, Truyen
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 636 - 642
  • [8] Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
    Jiang, Jianwen
    Chen, Ziqiang
    Lin, Haojie
    Zhao, Xibin
    Gao, Yue
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11101 - 11108
  • [9] Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
    Cheng, Yi
    Fan, Hehe
    Lin, Dongyun
    Sun, Ying
    Kankanhalli, Mohan
    Lim, Joo-Hwee
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6131 - 6141
  • [10] (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
    Cherian, Anoop
    Hori, Chiori
    Marks, Tim K.
    Le Roux, Jonathan
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 444 - 453