Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering

被引:0
|
作者
Mao, Jianguo [1 ,2 ]
Jiang, Wenbin [3 ]
Wang, Xiangdong [1 ]
Feng, Zhifan [3 ]
Lyu, Yajuan [3 ]
Liu, Hong [1 ]
Zhu, Yong [3 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing Key Lab Mobile Comp & Pervas Device, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Baidu Inc, Beijing, Peoples R China
基金
北京市自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing video question answering (video QA) models lack the capacity for deep video understanding and flexible multistep reasoning. We propose for video QA a novel model which performs dynamic multistep reasoning between questions and videos. It creates video semantic representation based on the video scene graph composed of semantic elements of the video and semantic relations among these elements. Then, it performs multistep reasoning for better answer decision between the representations of the question and the video, and dynamically integrate the reasoning results. Experiments show the significant advantage of the proposed model against previous methods in accuracy and interpretability. Against the existing state-of-the-art model, the proposed model dramatically improves more than 4%/3.1%/2% on the three widely used video QA datasets, MSRVTT-QA, MSRVTT multi-choice, and TGIF-QA, and displays better interpretability by backtracing along with the attention mechanisms to the video scene graphs.
引用
收藏
页码:3894 / 3904
页数:11
相关论文
共 50 条
  • [21] Explore Multi-Step Reasoning in Video Question Answering
    Song, Xiaomeng
    Shi, Yucheng
    Chen, Xin
    Han, Yahong
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 239 - 247
  • [22] Collaborative Aware Bidirectional Semantic Reasoning for Video Question Answering
    Wu, Xize
    Wu, Jiasong
    Zhu, Lei
    Senhadji, Lotfi
    Shu, Huazhong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2074 - 2086
  • [23] Graph-Based Multi-Interaction Network for Video Question Answering
    Gu, Mao
    Zhao, Zhou
    Jin, Weike
    Hong, Richang
    Wu, Fei
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 2758 - 2770
  • [24] Dynamic Reasoning with Language Model and Knowledge Graph for Question Answering
    Lu, Yujie
    Wu, Dean
    Zhang, Yuhong
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT IV, 2024, 14807 : 441 - 455
  • [25] Cascade transformers with dynamic attention for video question answering
    Jiang, Yimin
    Yan, Tingfei
    Yao, Mingze
    Wang, Huibing
    Liu, Wenzhe
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 242
  • [26] Video Reference: A Video Question Answering Engine
    Gao, Lei
    Li, Guangda
    Zheng, Yan-Tao
    Hong, Richang
    Chua, Tat-Seng
    ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
  • [27] Dynamic Scene Graph Representation for Surgical Video
    Holm, Felix
    Ghazaei, Ghazal
    Czempiel, Tobias
    Oezsoy, Ege
    Saur, Stefan
    Navab, Nassir
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 81 - 87
  • [28] Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering
    Koner, Rajat
    Li, Hang
    Hildebrandt, Marcel
    Das, Deepan
    Tresp, Volker
    Guennemann, Stephan
    SEMANTIC WEB - ISWC 2021, 2021, 12922 : 111 - 127
  • [29] Differentiated Attention with Multi-modal Reasoning for Video Question Answering
    Yao, Shentao
    Li, Kun
    Xing, Kun
    Wu, Kewei
    Xie, Zhao
    Guo, Dan
    2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, BIG DATA AND ALGORITHMS (EEBDA), 2022, : 525 - 530
  • [30] Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering
    Mao, Jianguo
    Jiang, Wenbin
    Liu, Hong
    Wang, Xiangdong
    Lyu, Yajuan
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13380 - 13388