Multimodal deep fusion for image question answering

被引:20
|
作者
Zhang, Weifeng [1 ]
Yu, Jing [2 ]
Wang, Yuxia [3 ]
Wang, Wei [3 ]
机构
[1] Jiaxing Univ, Coll Math Phys & Informat Engn, Jiaxing, Zhejiang, Peoples R China
[2] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[3] Jiangnan Elect Commun Inst, Jiaxing, Zhejiang, Peoples R China
关键词
Multimodal fusion; Image question answering; Graph neural networks; ATTENTION;
D O I
10.1016/j.knosys.2020.106639
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal fusion plays a key role in Image Question Answering (IQA). However, most of the current algorithms are insufficient to fuse multiple relations implied in multimodalities which are vital for predicting correct answers. In this paper, we design an effective Multimodal Deep Fusion Network (MDFNet) to achieve fine-grained multimodal fusion. Specifically, we propose Graph Reasoning and Fusion Layer (GRFL) to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. This fusion strategy allows different relations make different contribution guided by the reasoning step. Then a Multimodal Deep Fusion Network is built based on stacking several GRFLs, to achieve sufficient multimodal fusion. Quantitative and qualitative experiments conducted on popular benchmarks including VQA v2 and GQA reveal the effectiveness of DMFNet. Our best single model achieves 71.19% overall accuracy on VQA v2 dataset, and 57.05% accuracy on GQA dataset. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Intelligent visual question answering in TCM education: An innovative application of IoT and multimodal fusion
    Bi, Wei
    Xiong, Qingzhen
    Chen, Xingyi
    Du, Qingkun
    Wu, Jun
    Zhuang, Zhaoyu
    [J]. Alexandria Engineering Journal, 2025, 118 : 325 - 336
  • [22] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [23] Faithful Multimodal Explanation for Visual Question Answering
    Wu, Jialin
    Mooney, Raymond J.
    [J]. BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 103 - 112
  • [24] Finetuning Language Models for Multimodal Question Answering
    Zhang, Xin
    Xie, Wen
    Dai, Ziqi
    Rao, Jun
    Wen, Haokun
    Luo, Xuan
    Zhang, Meishan
    Zhang, Min
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9420 - 9424
  • [25] Interactive Question Answering for Multimodal Lifelog Retrieval
    Ly-Duyen Tran
    Zhou, Liting
    Binh Nguyen
    Gurrin, Cathal
    [J]. MULTIMEDIA MODELING, MMM 2024, PT V, 2024, 14565 : 68 - 81
  • [26] Deep multimodal fusion for semantic image segmentation: A survey
    Zhang, Yifei
    Sidibe, Desire
    Morel, Olivier
    Meriaudeau, Fabrice
    [J]. IMAGE AND VISION COMPUTING, 2021, 105
  • [27] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
    Mahamoud, Ibrahim Souleiman
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    [J]. DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
  • [28] ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese
    Tran, Khiem Vinh
    Phan, Hao Phu
    Van Nguyen, Kiet
    Nguyen, Ngan Luu Thuy
    [J]. MULTIMEDIA SYSTEMS, 2024, 30 (04)
  • [29] Question Answering Algorithm on Image Fragmentation Information Based on Deep Neural Network
    Wang Y.
    Zhuo Y.
    Wu Y.
    Chen M.
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2018, 55 (12): : 2600 - 2610
  • [30] Deep Question Answering for protein annotation
    Gobeill, Julien
    Gaudinat, Arnaud
    Pasche, Emilie
    Vishnyakova, Dina
    Gaudet, Pascale
    Bairoch, Amos
    Ruch, Patrick
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2015,