IMCN: Improved modular co-attention networks for visual question answering

被引:1
|
作者
Liu, Cheng [1 ,2 ]
Wang, Chao [1 ,2 ]
Peng, Yan [1 ,2 ,3 ]
机构
[1] Shanghai Univ, Sch Future Technol, Shanghai, Peoples R China
[2] Shanghai Univ, Inst Artificial Intelligence, Shanghai, Peoples R China
[3] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
基金
上海市自然科学基金;
关键词
Co-attention; Multimodal; Self-attention; Visual question answering;
D O I
10.1007/s10489-024-05456-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many existing Visual Question Answering (VQA) methods use traditional attention mechanisms to focus on each region of the input image and each word of the input question and achieve well performance. However, the most obvious limitation of traditional attention mechanisms is that the module always generates a weighted average based on a specific query. When all regions and words are unsatisfied with the query, the generated vectors, which are noisy information, may lead to incorrect predictions. In this paper, we propose an Improved Modular Co-attention Network (IMCN) by incorporating the Attention on Attention (AoA) module into the self-attention module and the co-attention module to solve this problem. AoA adds another attention process by using element-wise multiplication on the information vector and the attention gate, which are both generated from the attention result and the current context. With AoA, the attended information obtained by the model is more useful. We also introduce an Improved Multimodal Fusion Network (IMFN), which leverages various branches to achieve hierarchical fusion, to fuse visual features and textual features for further improvements. We conduct extensive experiments on the VQA-v2 dataset to verify the effectiveness of the proposed modules and experimental results demonstrate our model outperforms the existing methods.
引用
收藏
页码:5167 / 5182
页数:16
相关论文
共 50 条
  • [1] Deep Modular Co-Attention Networks for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Cui, Yuhao
    Tao, Dacheng
    Tian, Qi
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
  • [2] An Effective Dense Co-Attention Networks for Visual Question Answering
    He, Shirong
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (17) : 1 - 15
  • [3] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    [J]. SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [4] Dual self-attention with co-attention networks for visual question answering
    Liu, Yun
    Zhang, Xiaoming
    Zhang, Qianyun
    Li, Chaozhuo
    Huang, Feiran
    Tang, Xianghong
    Li, Zhoujun
    [J]. PATTERN RECOGNITION, 2021, 117
  • [5] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    [J]. Soft Computing, 2021, 25 : 5411 - 5421
  • [6] Sparse co-attention visual question answering networks based on thresholds
    Guo, Zihan
    Han, Dezhi
    [J]. APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600
  • [7] Sparse co-attention visual question answering networks based on thresholds
    Zihan Guo
    Dezhi Han
    [J]. Applied Intelligence, 2023, 53 : 586 - 600
  • [8] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [9] Dynamic Co-attention Network for Visual Question Answering
    Ebaid, Doaa B.
    Madbouly, Magda M.
    El-Zoghabi, Adel A.
    [J]. 2021 8TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2021), 2021, : 125 - 129
  • [10] Multi-modal co-attention relation networks for visual question answering
    Zihan Guo
    Dezhi Han
    [J]. The Visual Computer, 2023, 39 : 5783 - 5795