Integrating multimodal features by a two-way co-attention mechanism for visual question answering

被引:0
|
作者
Sharma, Himanshu [1 ]
Srivastava, Swati [1 ]
机构
[1] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura, India
关键词
VQA; Attention; Co-attention; Multimodal; Relational reasoning;
D O I
10.1007/s11042-023-17945-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Existing VQA models predominantly rely on attention mechanisms that prioritize spatial dimensions, adjusting the importance of image regions or word token features based on spatial probabilities. However, these approaches often struggle with relational reasoning, treating objects independently, and failing to fuse their features effectively. This hampers the model's ability to understand complex visual contexts and provide accurate answers. To address these limitations, our innovation introduces a novel co-attention mechanism in the VQA model. This mechanism enhances Faster R-CNN's feature extraction by emphasizing image regions relevant to the posed question. This, in turn, improves the model's ability for visual relationship reasoning, making it more adept at analyzing complex visual contexts. Additionally, our model incorporates feature-wise multimodal two-way co-attentions, enabling seamless integration of image and question representations, resulting in more precise answer predictions. Our model achieves impressive results on VQA 1.0, surpassing the best existing model, Re-attention model by 1.14% on test-std. Moreover, on VQA 2.0, our model outperforms the best model, IAHOT model by a significant margin of 2.98% on test-std. These findings demonstrate that our approach not only outperforms earlier models but also establishes a new state-of-the-art performance level in Visual Question Answering.
引用
收藏
页码:59577 / 59595
页数:19
相关论文
共 50 条
  • [1] Enhancing visual question answering with a two-way co-attention mechanism and integrated multimodal features
    Agrawal, Mayank
    Jalal, Anand Singh
    Sharma, Himanshu
    [J]. COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
  • [2] Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism
    Sharma, Himanshu
    Srivastava, Swati
    [J]. IMAGING SCIENCE JOURNAL, 2021, 69 (1-4): : 177 - 189
  • [3] Multimodal feature-wise co-attention method for visual question answering
    Zhang, Sheng
    Chen, Min
    Chen, Jincai
    Zou, Fuhao
    Li, Yuan-Fang
    Lu, Ping
    [J]. INFORMATION FUSION, 2021, 73 : 1 - 10
  • [4] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [5] Dynamic Co-attention Network for Visual Question Answering
    Ebaid, Doaa B.
    Madbouly, Magda M.
    El-Zoghabi, Adel A.
    [J]. 2021 8TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2021), 2021, : 125 - 129
  • [6] Hierarchical Question-Image Co-Attention for Visual Question Answering
    Lu, Jiasen
    Yang, Jianwei
    Batra, Dhruv
    Parikh, Devi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [7] Co-attention Network for Visual Question Answering Based on Dual Attention
    Dong, Feng
    Wang, Xiaofeng
    Oad, Ammar
    Talpur, Mir Sajjad Hussain
    [J]. Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123
  • [8] Deep Modular Co-Attention Networks for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Cui, Yuhao
    Tao, Dacheng
    Tian, Qi
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
  • [9] An Effective Dense Co-Attention Networks for Visual Question Answering
    He, Shirong
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (17) : 1 - 15
  • [10] Co-attention graph convolutional network for visual question answering
    Liu, Chuan
    Tan, Ying-Ying
    Xia, Tian-Tian
    Zhang, Jiajing
    Zhu, Ming
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543