Multimodal feature fusion by relational reasoning and attention for visual question answering

被引:46
|
作者
Zhang, Weifeng [1 ]
Yu, Jing [2 ]
Hu, Hua [3 ]
Hu, Haiyang [3 ]
Qin, Zengchang [4 ]
机构
[1] Jiaxing Univ, Coll Math Phys & Informat Engn, Jiaxing, Zhejiang, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
[3] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Hangzhou, Zhejiang, Peoples R China
[4] Beihang Univ, Sch ASEE, Intelligent Comp & Machine Learning Lab, Beijing, Peoples R China
关键词
Multimodal fusion; Visual question answering; Visual relational reasoning; Attention mechanism; INFORMATION FUSION; NETWORK;
D O I
10.1016/j.inffus.2019.08.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The recently emerged research of Visual Question Answering (VQA) has become a hot topic in computer vision. A key solution to VQA exists in how to fuse multimodal features extracted from image and question. In this paper, we show that combining visual relationship and attention together achieves more fine-grained feature fusion. Specifically, we design an effective and efficient module to reason complex relationship between visual objects. In addition, a bilinear attention module is learned for question guided attention on visual objects, which allows us to obtain more discriminative visual features. Given an image and a question in natural language, our VQA model learns visual relational reasoning network and attention network in parallel to fuse fine-grained textual and visual features, so that answers can be predicted accurately. Experimental results show that our approach achieves new state-of-the-art performance of single model on both VQA 1.0 and VQA 2.0 datasets.
引用
收藏
页码:116 / 126
页数:11
相关论文
共 50 条
  • [1] Visual Question Answering on CLEVR Dataset via Multimodal Fusion and Relational Reasoning
    Allahyari, Abbas
    Borna, Keivan
    [J]. 2021 52ND ANNUAL IRANIAN MATHEMATICS CONFERENCE (AIMC), 2021, : 74 - 76
  • [2] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
  • [3] Relational reasoning and adaptive fusion for visual question answering
    Shen, Xiang
    Han, Dezhi
    Zong, Liang
    Guo, Zihan
    Hua, Jie
    [J]. APPLIED INTELLIGENCE, 2024, 54 (06) : 5062 - 5080
  • [4] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    [J]. ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
  • [5] Deep multimodal relational reasoning and guided attention for chart question answering
    Srivastava, Swati
    Sharma, Himanshu
    [J]. Journal of Electronic Imaging, 2024, 33 (06)
  • [6] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    [J]. INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [7] Multi-scale Relational Reasoning with Regional Attention for Visual Question Answering
    Ma, Yuntao
    Lu, Tong
    Wu, Yirui
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5642 - 5649
  • [8] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [9] DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
    Zhang, Weifeng
    Yu, Jing
    Zhao, Wenhong
    Ran, Chuan
    [J]. Information Fusion, 2021, 72 : 70 - 79
  • [10] DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
    Zhang, Weifeng
    Yu, Jing
    Zhao, Wenhong
    Ran, Chuan
    [J]. INFORMATION FUSION, 2021, 72 : 70 - 79