Deep Modular Bilinear Attention Network for Visual Question Answering

被引:0
|
作者
Yan, Feng [1 ]
Silamu, Wushouer [1 ,2 ]
Li, Yanbing [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi 830046, Peoples R China
[2] Xinjiang Univ, Lab Multilingual Informat Technol, Urumqi 830046, Peoples R China
基金
中国国家自然科学基金;
关键词
attention mechanism; visual question answering; multi-model; bilinear attention network;
D O I
10.3390/s22031045
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Multimodal Local Perception Bilinear Pooling for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    [J]. IEEE ACCESS, 2018, 6 : 57923 - 57932
  • [42] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Fan, Jianping
    Tao, Dacheng
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
  • [43] Visual question answering model based on graph neural network and contextual attention
    Sharma, Himanshu
    Jalal, Anand Singh
    [J]. IMAGE AND VISION COMPUTING, 2021, 110
  • [44] Multi-Channel Co-Attention Network for Visual Question Answering
    Tian, Weidong
    He, Bin
    Wang, Nanxun
    Zhao, Zhongqiu
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [45] CRA-Net: Composed Relation Attention Network for Visual Question Answering
    Peng, Liang
    Yang, Yang
    Wang, Zheng
    Wu, Xiao
    Huang, Zi
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1202 - 1210
  • [46] Compound-Attention Network with Original Feature injection for visual question and answering
    Wu, Chunlei
    Lu, Jing
    Li, Haisheng
    Wu, Jie
    Duan, Hailong
    Yuan, Shaozu
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2021, 15 (08) : 1853 - 1861
  • [47] Efficient Multi-step Reasoning Attention Network for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    Zhang, Meng
    [J]. THIRTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2021), 2022, 12083
  • [48] Multi-Modality Global Fusion Attention Network for Visual Question Answering
    Yang, Cheng
    Wu, Weijia
    Wang, Yuxing
    Zhou, Hong
    [J]. ELECTRONICS, 2020, 9 (11) : 1 - 12
  • [49] Compound-Attention Network with Original Feature injection for visual question and answering
    Chunlei Wu
    Jing Lu
    Haisheng Li
    Jie Wu
    Hailong Duan
    Shaozu Yuan
    [J]. Signal, Image and Video Processing, 2021, 15 : 1853 - 1861
  • [50] Affective Visual Question Answering Network
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Dong, Ming
    [J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 170 - 173