DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation

被引:0
|
作者
Zhang, Weifeng [1 ]
Yu, Jing [2 ]
Zhao, Wenhong [3 ]
Ran, Chuan [4 ]
机构
[1] Jiaxing University, Zhejiang, China
[2] Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
[3] Nanhu College, Jiaxing University, Zhejiang, China
[4] IBM Corporation, NC, United States
关键词
Artificial intelligence - Natural language processing systems - Visual languages;
D O I
暂无
中图分类号
学科分类号
摘要
Visual Question Answering (VQA), which aims to answer questions in natural language according to the content of image, has attracted extensive attention from artificial intelligence community. Multimodal reasoning and fusion is a central component in recent VQA models. However, most existing VQA models are still insufficient to reason and fuse clues from multiple modalities. Furthermore, they are lack of interpretability since they disregard the explanations. We argue that reasoning and fusing multiple relations implied in multimodalities contributes to more accurate answers and explanations. In this paper, we design an effective multimodal reasoning and fusion model to achieve fine-grained multimodal reasoning and fusion. Specifically, we propose Multi-Graph Reasoning and Fusion (MGRF) layer, which adopts pre-trained semantic relation embeddings, to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. The MGRF layers can be further stacked in depth to form Deep Multimodal Reasoning and Fusion Network (DMRFNet) to sufficiently reason and fuse multimodal relations. Furthermore, an explanation generation module is designed to justify the predicted answer. This justification reveals the motive of the model's decision and enhances the model's interpretability. Quantitative and qualitative experimental results on VQA 2.0, and VQA-E datasets show DMRFNet's effectiveness. © 2021 Elsevier B.V.
引用
收藏
页码:70 / 79
相关论文
共 50 条
  • [21] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [22] Joint Answering and Explanation for Visual Commonsense Reasoning
    Li, Zhenyang
    Guo, Yangyang
    Wang, Kejie
    Wei, Yinwei
    Nie, Liqiang
    Kankanhalli, Mohan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 3836 - 3846
  • [23] Question guided multimodal receptive field reasoning network for fact-based visual question answering
    Zicheng Zuo
    Yanhan Sun
    Zhenfang Zhu
    Mei Wu
    Hui Zhao
    Multimedia Tools and Applications, 2025, 84 (12) : 11063 - 11078
  • [24] Compositional Substitutivity of Visual Reasoning for Visual Question Answering
    Li, Chuanhao
    Li, Zhen
    Jing, Chenchen
    Wu, Yuwei
    Zhai, Mingliang
    Jia, Yunde
    COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 143 - 160
  • [25] Medical Visual Question Answering Model Based on Knowledge Enhancement and Multimodal Fusion
    Dianyuan, Zhang
    Chuanming, Yu
    Data Analysis and Knowledge Discovery, 2024, 8 (8-9) : 226 - 239
  • [26] OpenViVQA: Task, dataset, and multimodal fusion models for visual question answering in Vietnamese
    Nguyen, Nghia Hieu
    Vo, Duong T. D.
    Nguyen, Kiet Van
    Nguyen, Ngan Luu-Thuy
    INFORMATION FUSION, 2023, 100
  • [27] An Adaptive Multimodal Fusion Network Based on Multilinear Gradients for Visual Question Answering
    Zhao, Chengfang
    Tang, Mingwei
    Zheng, Yanxi
    Ran, Chaocong
    ELECTRONICS, 2025, 14 (01):
  • [28] PRIOR VISUAL RELATIONSHIP REASONING FOR VISUAL QUESTION ANSWERING
    Yang, Zhuoqian
    Qin, Zengchang
    Yu, Jing
    Wan, Tao
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1411 - 1415
  • [29] Visual question answering by pattern matching and reasoning
    Zhan, Huayi
    Xiong, Peixi
    Wang, Xin
    Yang, Lan
    NEUROCOMPUTING, 2022, 467 : 323 - 336
  • [30] FROM SHALLOW TO DEEP: COMPOSITIONAL REASONING OVER GRAPHS FOR VISUAL QUESTION ANSWERING
    Zhu, Zihao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 8217 - 8221