VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

被引:1
|
作者
Wang, Yanan [1 ,2 ]
Yasunaga, Michihiro [2 ]
Ren, Hongyu [2 ]
Wada, Shinya [1 ]
Leskovec, Jure [2 ]
机构
[1] KDDI Res, Fujimino, Japan
[2] Stanford Univ, Stanford, CA 94305 USA
关键词
D O I
10.1109/ICCV51070.2023.01973
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.
引用
收藏
页码:21525 / 21535
页数:11
相关论文
共 50 条
  • [1] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    [J]. 2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [2] Visual Question Answering reasoning with external knowledge based on bimodal graph neural network
    Yang, Zhenyu
    Wu, Lei
    Wen, Peian
    Chen, Peng
    [J]. ELECTRONIC RESEARCH ARCHIVE, 2023, 31 (04): : 1948 - 1965
  • [3] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
    Saqur, Raeid
    Narasimhan, Karthik
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [4] Temporal knowledge graph question answering via subgraph reasoning
    Chen, Ziyang
    Zhao, Xiang
    Liao, Jinzhi
    Li, Xinyi
    Kanoulas, Evangelos
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 251
  • [5] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [6] BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining
    Kim, MinJun
    Song, SeungWoo
    Lee, YouHan
    Jang, Haneol
    Lim, KyungTae
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18381 - 18389
  • [7] Multimodal Graph Reasoning and Fusion for Video Question Answering
    Zhang, Shuai
    Wang, Xingfu
    Hawbani, Ammar
    Zhao, Liang
    Alsamhi, Saeed Hamood
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
  • [8] Variational Reasoning for Question Answering with Knowledge Graph
    Zhang, Yuyu
    Dai, Hanjun
    Kozareva, Zornitsa
    Smola, Alexander J.
    Song, Le
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 6069 - 6076
  • [9] Graph neural networks for visual question answering: a systematic review
    Abdulganiyu Abdu Yusuf
    Chong Feng
    Xianling Mao
    Ramadhani Ally Duma
    Mohammed Salah Abood
    Abdulrahman Hamman Adama Chukkol
    [J]. Multimedia Tools and Applications, 2024, 83 : 55471 - 55508
  • [10] Graph neural networks for visual question answering: a systematic review
    Yusuf, Abdulganiyu Abdu
    Feng, Chong
    Mao, Xianling
    Ally Duma, Ramadhani
    Abood, Mohammed Salah
    Chukkol, Abdulrahman Hamman Adama
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (18) : 55471 - 55508