VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

被引：1

作者：

Wang, Yanan ^{[1
,2
]}

Yasunaga, Michihiro ^{[2
]}

Ren, Hongyu ^{[2
]}

Wada, Shinya ^{[1
]}

Leskovec, Jure ^{[2
]}

机构：

[1] KDDI Res, Fujimino, Japan

[2] Stanford Univ, Stanford, CA 94305 USA

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.01973

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.

引用

页码：21525 / 21535

页数：11

共 50 条

[21] OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
Marino, Kenneth
Rastegari, Mohammad
Farhadi, Ali
Mottaghi, Roozbeh
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3190 - 3199
[22] Semantic Relation Graph Reasoning Network for Visual Question Answering
Lan, Hong
Zhang, Pufen
TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
[23] Explainable Knowledge reasoning via thought chains for knowledge-based visual question answering
Qiu, Chen
Xie, Zhiqiang
Liu, Maofu
Hu, Huijun
INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (04)
[24] Multi-Hop Reasoning for Question Answering with Knowledge Graph
Zhang, Jiayuan
Cai, Yifei
Zhang, Qian
Cao, Zehao
Cheng, Zhenrong
Li, Dongmei
Meng, Xianghao
2021 IEEE/ACIS 20TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2021-SUMMER), 2021, : 121 - 125
[25] VQA-BC: ROBUST VISUAL QUESTION ANSWERING VIA BIDIRECTIONAL CHAINING
Lao, Mingrui
Guo, Yanming
Chen, Wei
Pu, Nan
Lew, Michael S.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4833 - 4837
[26] Medical Visual Question Answering via Conditional Reasoning
Zhan, Li-Ming
Liu, Bo
Fan, Lu
Chen, Jiaxin
Wu, Xiao-Ming
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2345 - 2354
[27] INTERPRETABLE VISUAL QUESTION ANSWERING VIA REASONING SUPERVISION
Parelli, Maria
Mallis, Dimitrios
Diomataris, Markos
Pitsikalis, Vassilis
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2525 - 2529
[28] Visual Question Answering Research on Joint Knowledge and Visual Information Reasoning
Su, Zhenqiang
Gou, Gang
Computer Engineering and Applications, 2024, 60 (05) : 95 - 102
[29] Question Answering by Reasoning Across Documents with Graph Convolutional Networks
De Cao, Nicola
Aziz, Wilker
Titov, Ivan
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2306 - 2317
[30] Multimodal feature fusion by relational reasoning and attention for visual question answering
Zhang, Weifeng
Yu, Jing
Hu, Hua
Hu, Haiyang
Qin, Zengchang
INFORMATION FUSION, 2020, 55 (55) : 116 - 126

← 1 2 3 4 5 →