DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation

被引：0

作者：

Zhang, Weifeng ^{[1
]}

Yu, Jing ^{[2
]}

Zhao, Wenhong ^{[3
]}

Ran, Chuan ^{[4
]}

机构：

[1] Jiaxing University, Zhejiang, China

[2] Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

[3] Nanhu College, Jiaxing University, Zhejiang, China

[4] IBM Corporation, NC, United States

来源：

Information Fusion | 2021年 / 72卷

关键词：

Artificial intelligence - Natural language processing systems - Visual languages;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Visual Question Answering (VQA), which aims to answer questions in natural language according to the content of image, has attracted extensive attention from artificial intelligence community. Multimodal reasoning and fusion is a central component in recent VQA models. However, most existing VQA models are still insufficient to reason and fuse clues from multiple modalities. Furthermore, they are lack of interpretability since they disregard the explanations. We argue that reasoning and fusing multiple relations implied in multimodalities contributes to more accurate answers and explanations. In this paper, we design an effective multimodal reasoning and fusion model to achieve fine-grained multimodal reasoning and fusion. Specifically, we propose Multi-Graph Reasoning and Fusion (MGRF) layer, which adopts pre-trained semantic relation embeddings, to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. The MGRF layers can be further stacked in depth to form Deep Multimodal Reasoning and Fusion Network (DMRFNet) to sufficiently reason and fuse multimodal relations. Furthermore, an explanation generation module is designed to justify the predicted answer. This justification reveals the motive of the model's decision and enhances the model's interpretability. Quantitative and qualitative experimental results on VQA 2.0, and VQA-E datasets show DMRFNet's effectiveness. © 2021 Elsevier B.V.

引用

页码：70 / 79

共 50 条

[21] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[22] Joint Answering and Explanation for Visual Commonsense Reasoning
Li, Zhenyang
Guo, Yangyang
Wang, Kejie
Wei, Yinwei
Nie, Liqiang
Kankanhalli, Mohan
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3836 - 3846
[23] Question guided multimodal receptive field reasoning network for fact-based visual question answering
Zicheng Zuo
Yanhan Sun
Zhenfang Zhu
Mei Wu
Hui Zhao
Multimedia Tools and Applications, 2025, 84 (12) : 11063 - 11078
[24] Compositional Substitutivity of Visual Reasoning for Visual Question Answering
Li, Chuanhao
Li, Zhen
Jing, Chenchen
Wu, Yuwei
Zhai, Mingliang
Jia, Yunde
COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 143 - 160
[25] Medical Visual Question Answering Model Based on Knowledge Enhancement and Multimodal Fusion
Dianyuan, Zhang
Chuanming, Yu
Data Analysis and Knowledge Discovery, 2024, 8 (8-9) : 226 - 239
[26] OpenViVQA: Task, dataset, and multimodal fusion models for visual question answering in Vietnamese
Nguyen, Nghia Hieu
Vo, Duong T. D.
Nguyen, Kiet Van
Nguyen, Ngan Luu-Thuy
INFORMATION FUSION, 2023, 100
[27] An Adaptive Multimodal Fusion Network Based on Multilinear Gradients for Visual Question Answering
Zhao, Chengfang
Tang, Mingwei
Zheng, Yanxi
Ran, Chaocong
ELECTRONICS, 2025, 14 (01):
[28] PRIOR VISUAL RELATIONSHIP REASONING FOR VISUAL QUESTION ANSWERING
Yang, Zhuoqian
Qin, Zengchang
Yu, Jing
Wan, Tao
2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1411 - 1415
[29] Visual question answering by pattern matching and reasoning
Zhan, Huayi
Xiong, Peixi
Wang, Xin
Yang, Lan
NEUROCOMPUTING, 2022, 467 : 323 - 336
[30] FROM SHALLOW TO DEEP: COMPOSITIONAL REASONING OVER GRAPHS FOR VISUAL QUESTION ANSWERING
Zhu, Zihao
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8217 - 8221

← 1 2 3 4 5 →