Multimodal deep fusion for image question answering

被引:20
|
作者
Zhang, Weifeng [1 ]
Yu, Jing [2 ]
Wang, Yuxia [3 ]
Wang, Wei [3 ]
机构
[1] Jiaxing Univ, Coll Math Phys & Informat Engn, Jiaxing, Zhejiang, Peoples R China
[2] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[3] Jiangnan Elect Commun Inst, Jiaxing, Zhejiang, Peoples R China
关键词
Multimodal fusion; Image question answering; Graph neural networks; ATTENTION;
D O I
10.1016/j.knosys.2020.106639
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multimodal fusion plays a key role in Image Question Answering (IQA). However, most of the current algorithms are insufficient to fuse multiple relations implied in multimodalities which are vital for predicting correct answers. In this paper, we design an effective Multimodal Deep Fusion Network (MDFNet) to achieve fine-grained multimodal fusion. Specifically, we propose Graph Reasoning and Fusion Layer (GRFL) to reason complex spatial and semantic relations between visual objects and fuse these two kinds of relations adaptively. This fusion strategy allows different relations make different contribution guided by the reasoning step. Then a Multimodal Deep Fusion Network is built based on stacking several GRFLs, to achieve sufficient multimodal fusion. Quantitative and qualitative experiments conducted on popular benchmarks including VQA v2 and GQA reveal the effectiveness of DMFNet. Our best single model achieves 71.19% overall accuracy on VQA v2 dataset, and 57.05% accuracy on GQA dataset. (C) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Distributed Deep Learning for Question Answering
    Feng, Minwei
    Xiang, Bing
    Zhou, Bowen
    [J]. CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, : 2413 - 2416
  • [32] An Effective Deep Transfer Learning and Information Fusion Framework for Medical Visual Question Answering
    Liu, Feifan
    Peng, Yalei
    Rosen, Max P.
    [J]. EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION (CLEF 2019), 2019, 11696 : 238 - 247
  • [33] Multimodal representative answer extraction in community question answering
    Li, Ming
    Ma, Yating
    Li, Ying
    Bai, Yixue
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (09)
  • [34] Adversarial Multimodal Network for Movie Story Question Answering
    Yuan, Zhaoquan
    Sun, Siyuan
    Duan, Lixin
    Li, Changsheng
    Wu, Xiao
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1744 - 1756
  • [35] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    [J]. 2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [36] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
  • [37] Health-Oriented Multimodal Food Question Answering
    Wang, Jianghai
    Hu, Menghao
    Song, Yaguang
    Yang, Xiaoshan
    [J]. MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 191 - 203
  • [38] Object-Assisted Question Featurization and Multi-CNN Image Feature Fusion for Visual Question Answering
    Manmadhan, Sruthy
    Kovoor, Binsu C.
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES, 2023, 19 (01)
  • [39] Dealing with spoken requests in a multimodal Question Answering system
    Gretter, Roberto
    Kouylekov, Milen
    Negri, Matteo
    [J]. ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, 2008, 5253 : 93 - 102
  • [40] Intelligent multimodal medical image fusion with deep guided filtering
    Rajalingam, B.
    Al-Turjman, Fadi
    Santhoshkumar, R.
    Rajesh, M.
    [J]. MULTIMEDIA SYSTEMS, 2022, 28 (04) : 1449 - 1463