Deep Modular Bilinear Attention Network for Visual Question Answering

被引:0
|
作者
Yan, Feng [1 ]
Silamu, Wushouer [1 ,2 ]
Li, Yanbing [1 ]
机构
[1] Xinjiang Univ, Sch Informat Sci & Engn, Urumqi 830046, Peoples R China
[2] Xinjiang Univ, Lab Multilingual Informat Technol, Urumqi 830046, Peoples R China
基金
中国国家自然科学基金;
关键词
attention mechanism; visual question answering; multi-model; bilinear attention network;
D O I
10.3390/s22031045
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Fusing Attention with Visual Question Answering
    Burt, Ryan
    Cudic, Mihael
    Principe, Jose C.
    [J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
  • [22] Two-step Joint Attention Network for Visual Question Answering
    Zhang, Weiming
    Zhang, Chunhong
    Liu, Pei
    Zhan, Zhiqiang
    Qiu, Xiaofeng
    [J]. 2017 3RD INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM), 2017, : 136 - 143
  • [23] ARDN: Attention Re-distribution Network for Visual Question Answering
    Yi, Jinyang
    Han, Dezhi
    Chen, Chongqing
    Shen, Xiang
    Zong, Liang
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024,
  • [24] Mutual Attention Inception Network for Remote Sensing Visual Question Answering
    Zheng, Xiangtao
    Wang, Binqiang
    Du, Xingqian
    Lu, Xiaoqiang
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [25] Co-attention graph convolutional network for visual question answering
    Liu, Chuan
    Tan, Ying-Ying
    Xia, Tian-Tian
    Zhang, Jiajing
    Zhu, Ming
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
  • [26] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin -Chen
    [J]. PATTERN RECOGNITION, 2022, 132
  • [27] Co-attention graph convolutional network for visual question answering
    Chuan Liu
    Ying-Ying Tan
    Tian-Tian Xia
    Jiajing Zhang
    Ming Zhu
    [J]. Multimedia Systems, 2023, 29 : 2527 - 2543
  • [28] Relation-Aware Graph Attention Network for Visual Question Answering
    Li, Linjie
    Gan, Zhe
    Cheng, Yu
    Liu, Jingjing
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
  • [29] MDAnet: Multiple Fusion Network with Double Attention for Visual Question Answering
    Feng, Junyi
    Gong, Ping
    Qiu, Guanghui
    [J]. ICVIP 2019: PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, 2019, : 143 - 147
  • [30] Path-Wise Attention Memory Network for Visual Question Answering
    Xiang, Yingxin
    Zhang, Chengyuan
    Han, Zhichao
    Yu, Hao
    Li, Jiaye
    Zhu, Lei
    [J]. MATHEMATICS, 2022, 10 (18)