A Multi-level Mesh Mutual Attention Model for Visual Question Answering

被引:0
|
作者
Zhi Lei
Guixian Zhang
Lijuan Wu
Kui Zhang
Rongjiao Liang
机构
[1] Guangxi Normal University,Guangxi Key Lab of Multi
来源
关键词
Visual question answering; Multi-level; Mutual attention; Multi-head;
D O I
暂无
中图分类号
学科分类号
摘要
Visual question answering is a complex multimodal task involving images and text, with broad application prospects in human–computer interaction and medical assistance. Therefore, how to deal with the feature interaction and multimodal feature fusion between the critical regions in the image and the keywords in the question is an important issue. To this end, we propose a neural network based on the encoder–decoder structure of the transformer architecture. Specifically, in the encoder, we use multi-head self-attention to mine word–word connections within question features and stack multiple layers of attention to obtain multi-level question features. We propose a mutual attention module to perform information exchange between modalities for better question features and image features representation on the decoder side. Besides, we connect the encoder and decoder in a meshed manner, perform mutual attention operations with multi-level question features, and aggregate information in an adaptive way. We propose a multi-scale fusion module in the fusion stage, which utilizes feature information at different scales to complete modal fusion. We test and validate the model effectiveness on VQA v1 and VQA v2 datasets. Our model achieves better results than state-of-the-art methods.
引用
收藏
页码:339 / 353
页数:14
相关论文
共 50 条
  • [31] Feature Enhancement in Attention for Visual Question Answering
    Lin, Yuetan
    Pang, Zhangyang
    Wang, Donghui
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4216 - 4222
  • [32] Cascading Attention Visual Question Answering Model Based on Graph Structure
    Zhang, Haoyu
    Zhang, De
    [J]. Computer Engineering and Applications, 2023, 59 (06) : 155 - 161
  • [33] Multi-modal spatial relational attention networks for visual question answering
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Sun, Yuxin
    Zhang, Zhi
    Luo, Yongkang
    [J]. IMAGE AND VISION COMPUTING, 2023, 140
  • [34] Multi-Channel Co-Attention Network for Visual Question Answering
    Tian, Weidong
    He, Bin
    Wang, Nanxun
    Zhao, Zhongqiu
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [35] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [36] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
    Guo, Zihan
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (23) : 1 - 15
  • [37] Multi-Modality Global Fusion Attention Network for Visual Question Answering
    Yang, Cheng
    Wu, Weijia
    Wang, Yuxing
    Zhou, Hong
    [J]. ELECTRONICS, 2020, 9 (11) : 1 - 12
  • [38] Efficient Multi-step Reasoning Attention Network for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    Zhang, Meng
    [J]. THIRTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2021), 2022, 12083
  • [39] Multi-scale Relational Reasoning with Regional Attention for Visual Question Answering
    Ma, Yuntao
    Lu, Tong
    Wu, Yirui
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5642 - 5649
  • [40] Multi-level retrieval with semantic Axiomatic Fuzzy Set clustering for question answering
    Lang, Qi
    Liu, Xiaodong
    Deng, Yingjie
    [J]. APPLIED SOFT COMPUTING, 2021, 111