Deep Modular Co-Attention Networks for Visual Question Answering

被引:508
|
作者
Yu, Zhou [1 ]
Yu, Jun [1 ]
Cui, Yuhao [1 ]
Tao, Dacheng [2 ]
Tian, Qi [3 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Key Lab Complex Syst Modeling & Simulat, Hangzhou, Peoples R China
[2] Univ Sydney, FEIT, Sch Comp Sci, UBTECH Sydney AI Ctr, Sydney, NSW, Australia
[3] Huawei, Noahs Ark Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金; 澳大利亚研究理事会;
关键词
D O I
10.1109/CVPR.2019.00644
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective 'co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper; we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the question-guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63% overall accuracy on the test-dev set.
引用
收藏
页码:6274 / 6283
页数:10
相关论文
共 50 条
  • [1] IMCN: Improved modular co-attention networks for visual question answering
    Liu, Cheng
    Wang, Chao
    Peng, Yan
    [J]. APPLIED INTELLIGENCE, 2024, 54 (06) : 5167 - 5182
  • [2] An Effective Dense Co-Attention Networks for Visual Question Answering
    He, Shirong
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (17) : 1 - 15
  • [3] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    [J]. SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [4] Dual self-attention with co-attention networks for visual question answering
    Liu, Yun
    Zhang, Xiaoming
    Zhang, Qianyun
    Li, Chaozhuo
    Huang, Feiran
    Tang, Xianghong
    Li, Zhoujun
    [J]. PATTERN RECOGNITION, 2021, 117 (117)
  • [5] Sparse co-attention visual question answering networks based on thresholds
    Guo, Zihan
    Han, Dezhi
    [J]. APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600
  • [6] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    [J]. Soft Computing, 2021, 25 : 5411 - 5421
  • [7] A medical visual question answering approach based on co-attention networks
    Cui, Wencheng
    Shi, Wentao
    Shao, Hong
    [J]. Shengwu Yixue Gongchengxue Zazhi/Journal of Biomedical Engineering, 2024, 41 (03): : 560 - 568
  • [8] Sparse co-attention visual question answering networks based on thresholds
    Zihan Guo
    Dezhi Han
    [J]. Applied Intelligence, 2023, 53 : 586 - 600
  • [9] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [10] Dynamic Co-attention Network for Visual Question Answering
    Ebaid, Doaa B.
    Madbouly, Magda M.
    El-Zoghabi, Adel A.
    [J]. 2021 8TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2021), 2021, : 125 - 129