Multimodal Bi-direction Guided Attention Networks for Visual Question Answering

被引:0
|
作者
Cai, Linqin [1 ]
Xu, Nuoying [1 ]
Tian, Hang [1 ]
Chen, Kejia [2 ]
Fan, Haodu [1 ]
机构
[1] Chongqing Univ Posts & Telecommun, Res Ctr Artificial Intelligence & Smart Educ, Chongqing 400065, Peoples R China
[2] Chengdu Huawei Technol Co Ltd, Chengdu 500643, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Attention mechanism; Position attention; Deep learning; FUSION; KNOWLEDGE;
D O I
10.1007/s11063-023-11403-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current visual question answering (VQA) has become a research hotspot in the computer vision and natural language processing field. A core solution of VQA is how to fuse multi-modal features from images and questions. This paper proposes a Multimodal Bi-direction Guided Attention Network (MBGAN) for VQA by combining visual relationships and attention to achieve more refined feature fusion. Specifically, the self-attention is used to extract image features and text features, the guided-attention is applied to obtain the correlation between each image area and the related question. To obtain the relative position relationship of different objects, position attention is further introduced to realize relationship correlation modeling and enhance the matching ability of multi-modal features. Given an image and a natural language question, the proposed MBGAN learns visual relation inference and question attention networks in parallel to achieve the fine-grained fusion of the visual features and the textual features, then the final answers can be obtained accurately through model stacking. MBGAN achieves 69.41% overall accuracy on the VQA-v1 dataset, 70.79% overall accuracy on the VQA-v2 dataset, and 68.79% overall accuracy on the COCO-QA dataset, which shows that the proposed MBGAN outperforms most of the state-of-the-art models.
引用
收藏
页码:11921 / 11943
页数:23
相关论文
共 50 条
  • [41] QUES-TO-VISUAL GUIDED VISUAL QUESTION ANSWERING
    Wu, Xiangyu
    Lu, Jianfeng
    Li, Zhuanfeng
    Xiong, Fengchao
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 4193 - 4197
  • [42] Guiding Visual Question Answering with Attention Priors
    Le, Thao Minh
    Le, Vuong
    Gupta, Sunil
    Venkatesh, Svetha
    Tran, Truyen
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4370 - 4379
  • [43] Re-Attention for Visual Question Answering
    Guo, Wenya
    Zhang, Ying
    Wu, Xiaoping
    Yang, Jufeng
    Cai, Xiangrui
    Yuan, Xiaojie
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 91 - 98
  • [44] Re-Attention for Visual Question Answering
    Guo, Wenya
    Zhang, Ying
    Yang, Jufeng
    Yuan, Xiaojie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 6730 - 6743
  • [45] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
  • [46] Feature Enhancement in Attention for Visual Question Answering
    Lin, Yuetan
    Pang, Zhangyang
    Wang, Donghui
    Zhuang, Yueting
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4216 - 4222
  • [47] Dynamic Capsule Attention for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Su, Jinsong
    Sun, Xiaoshuai
    Chen, Weiqiu
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9324 - 9331
  • [48] Text-Guided Dual-Branch Attention Network for Visual Question Answering
    Li, Mengfei
    Gu, Li
    Ji, Yi
    Liu, Chunping
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 750 - 760
  • [49] VIDEO QUESTION ANSWERING USING CLIP-GUIDED VISUAL-TEXT ATTENTION
    Ye, Shuhong
    Kong, Weikai
    Yao, Chenglin
    Ren, Jianfeng
    Jiang, Xudong
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 81 - 85
  • [50] Multi-modal spatial relational attention networks for visual question answering
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Sun, Yuxin
    Zhang, Zhi
    Luo, Yongkang
    IMAGE AND VISION COMPUTING, 2023, 140