DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation

被引：0

作者：

Zhang, Weifeng ^{[1
]}

Yu, Jing ^{[2
]}

Zhao, Wenhong ^{[3
]}

Ran, Chuan ^{[4
]}

机构：

[1] Jiaxing University, Zhejiang, China

[2] Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

[3] Nanhu College, Jiaxing University, Zhejiang, China

[4] IBM Corporation, NC, United States

来源：

Information Fusion | 2021年 / 72卷

关键词：

Artificial intelligence - Natural language processing systems - Visual languages;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Visual Question Answering (VQA), which aims to answer questions in natural language according to the content of image, has attracted extensive attention from artificial intelligence community. Multimodal reasoning and fusion is a central component in recent VQA models. However, most existing VQA models are still insufficient to reason and fuse clues from multiple modalities. Furthermore, they are lack of interpretability since they disregard the explanations. We argue that reasoning and fusing multiple relations implied in multimodalities contributes to more accurate answers and explanations. In this paper, we design an effective multimodal reasoning and fusion model to achieve fine-grained multimodal reasoning and fusion. Specifically, we propose Multi-Graph Reasoning and Fusion (MGRF) layer, which adopts pre-trained semantic relation embeddings, to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. The MGRF layers can be further stacked in depth to form Deep Multimodal Reasoning and Fusion Network (DMRFNet) to sufficiently reason and fuse multimodal relations. Furthermore, an explanation generation module is designed to justify the predicted answer. This justification reveals the motive of the model's decision and enhances the model's interpretability. Quantitative and qualitative experimental results on VQA 2.0, and VQA-E datasets show DMRFNet's effectiveness. © 2021 Elsevier B.V.

引用

页码：70 / 79

共 50 条

[41] Maintaining Reasoning Consistency in Compositional Visual Question Answering
Jing, Chenchen
Jia, Yunde
Wu, Yuwei
Liu, Xinyu
Wu, Qi
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5089 - 5098
[42] A DIAGNOSTIC STUDY OF VISUAL QUESTION ANSWERING WITH ANALOGICAL REASONING
Huang, Ziqi
Zhu, Hongyuan
Sun, Ying
Choi, Dongkyu
Tan, Cheston
Lim, Joo-Hwee
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2463 - 2467
[43] A Survey on Multimodal Large Language Models in Radiology for Report Generation and Visual Question Answering
Yi, Ziruo
Xiao, Ting
Albert, Mark V.
INFORMATION, 2025, 16 (02)
[44] Feature Fusion Attention Visual Question Answering
Wang, Chunlin
Sun, Jianyong
Chen, Xiaolin
ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
[45] Information fusion in visual question answering: A Survey
Zhang, Dongxiang
Cao, Rui
Wu, Sai
INFORMATION FUSION, 2019, 52 : 268 - 280
[46] Multimodal Prompt Retrieval for Generative Visual Question Answering
Ossowski, Timothy
Hu, Junjie
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 2518 - 2535
[47] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
Mahamoud, Ibrahim Souleiman
Coustaty, Mickael
Joseph, Aurelie
d'Andecy, Vincent Poulain
Ogier, Jean-Marc
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
[48] VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
Wang, Yanan
Yasunaga, Michihiro
Ren, Hongyu
Wada, Shinya
Leskovec, Jure
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21525 - 21535
[49] Visual Question Answering Research on Joint Knowledge and Visual Information Reasoning
Su, Zhenqiang
Gou, Gang
Computer Engineering and Applications, 2024, 60 (05) : 95 - 102
[50] Integrating Deep Learning and Non-monotonic Logical Reasoning for Explainable Visual Question Answering
Sridharan, Mohan
Riley, Heather
MULTI-AGENT SYSTEMS AND AGREEMENT TECHNOLOGIES, EUMAS 2020, AT 2020, 2020, 12520 : 558 - 570

← 1 2 3 4 5 →