Dynamic Spatio-Temporal Modular Network for Video Question Answering

被引:3
|
作者
Qian, Zi [1 ]
Wang, Xin [1 ]
Duan, Xuguang [1 ]
Chen, Hong [1 ]
Zhu, Wenwu [1 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
video question answering; modular neural network;
D O I
10.1145/3503161.3548061
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video Question Answering (VideoQA) aims to understand given videos and questions comprehensively by generating correct answers. However, existing methods usually rely on end-to-end blackbox deep neural networks to infer the answers, which significantly differs from human logic reasoning, thus lacking the ability to explain. Besides, the performances of existing methods tend to drop when answering compositional questions involving realistic scenarios. To tackle these challenges, we propose a Dynamic Spatio-Temporal Modular Network (DSTN) model, which utilizes a spatio-temporal modular network to simulate the compositional reasoning procedure of human beings. Concretely, we divide the task of answering a given question into a set of sub-tasks focusing on certain key concepts in questions and videos such as objects, actions, temporal orders, etc. Each sub-task can be solved with a separately designed module, e.g., spatial attention module, temporal attention module, logic module, and answer module. Then we dynamically assemble different modules assigned with different sub-tasks to generate a tree-structured spatio-temporal modular neural network for human-like reasoning before producing the final answer for the question. We carry out extensive experiments on the AGQA dataset to demonstrate our proposed DSTN model can significantly outperform several baseline methods in various settings. Moreover, we evaluate intermediate results and visualize each reasoning step to verify the rationality of different modules and the explainability of the proposed DSTN model.
引用
收藏
页码:4466 / 4477
页数:12
相关论文
共 50 条
  • [21] Spatio-Temporal Convolution-Attention Video Network
    Diba, Ali
    Sharma, Vivek
    Arzani, Mohammad. M.
    Van Gool, Luc
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 859 - 869
  • [22] Spatio-Temporal Deformable Attention Network for Video Deblurring
    Zhang, Huicong
    Xie, Haozhe
    Yao, Hongxun
    [J]. arXiv, 2022,
  • [23] Spatio-Temporal Inference Transformer Network for Video Inpainting
    Tudavekar, Gajanan
    Saraf, Santosh S.
    Patil, Sanjay R.
    [J]. INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2023, 23 (01)
  • [24] Spatio-temporal Sampling for Video
    Shankar, Mohan
    Pitsiauis, Nikos P.
    Brady, David
    [J]. IMAGE RECONSTRUCTION FROM INCOMPLETE DATA V, 2008, 7076
  • [25] DEFINING DYNAMIC SPATIO-TEMPORAL NEIGHBOURHOOD OF NETWORK DATA
    Cheng, Tao
    Anbaroglu, Berk
    [J]. JOINT INTERNATIONAL CONFERENCE ON THEORY, DATA HANDLING AND MODELLING IN GEOSPATIAL INFORMATION SCIENCE, 2010, 38 : 75 - 79
  • [26] Spatio-temporal prediction and reconstruction network for video anomaly detection
    Liu, Ting
    Zhang, Chengqing
    Niu, Xiaodong
    Wang, Liming
    [J]. PLOS ONE, 2022, 17 (05):
  • [27] A spatio-temporal network for video semantic segmentation in surgical videos
    Maria Grammatikopoulou
    Ricardo Sanchez-Matilla
    Felix Bragman
    David Owen
    Lucy Culshaw
    Karen Kerr
    Danail Stoyanov
    Imanol Luengo
    [J]. International Journal of Computer Assisted Radiology and Surgery, 2024, 19 : 375 - 382
  • [28] A spatio-temporal network for video semantic segmentation in surgical videos
    Grammatikopoulou, Maria
    Sanchez-Matilla, Ricardo
    Bragman, Felix
    Owen, David
    Culshaw, Lucy
    Kerr, Karen
    Stoyanov, Danail
    Luengo, Imanol
    [J]. INTERNATIONAL JOURNAL OF COMPUTER ASSISTED RADIOLOGY AND SURGERY, 2023, 19 (2) : 375 - 382
  • [29] SPATIO-TEMPORAL MOTION AGGREGATION NETWORK FOR VIDEO ACTION DETECTION
    Zhang, Hongcheng
    Zhao, Xu
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2180 - 2184
  • [30] Video object segmentation using spatio-temporal deep network
    Ramaswamy, Akshaya
    Gubbi, Jayavardhana
    Balamuralidhar, P.
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,