Deep Modular Co-Attention Networks for Visual Question Answering

被引:508
|
作者
Yu, Zhou [1 ]
Yu, Jun [1 ]
Cui, Yuhao [1 ]
Tao, Dacheng [2 ]
Tian, Qi [3 ]
机构
[1] Hangzhou Dianzi Univ, Sch Comp Sci & Technol, Key Lab Complex Syst Modeling & Simulat, Hangzhou, Peoples R China
[2] Univ Sydney, FEIT, Sch Comp Sci, UBTECH Sydney AI Ctr, Sydney, NSW, Australia
[3] Huawei, Noahs Ark Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金; 澳大利亚研究理事会;
关键词
D O I
10.1109/CVPR.2019.00644
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective 'co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper; we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the question-guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63% overall accuracy on the test-dev set.
引用
收藏
页码:6274 / 6283
页数:10
相关论文
共 50 条
  • [41] Regularizing Attention Networks for Anomaly Detection in Visual Question Answering
    Lee, Doyup
    Cheon, Yeongjae
    Han, Wook-Shin
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1845 - 1853
  • [42] Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?
    Das, Abhishek
    Agrawal, Harsh
    Zitnick, Larry
    Parikh, Devi
    Batra, Dhruv
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 163 : 90 - 100
  • [43] MVCE-Net: Multi-View Region Feature and Caption Enhancement Co-Attention Network for Visual Question Answering
    Yan, Feng
    Silamu, Wushouer
    Li, Yanbing
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 76 (01): : 65 - 80
  • [44] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
  • [45] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    [J]. INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [46] Differential Attention for Visual Question Answering
    Patro, Badri
    Namboodiri, Vinay P.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
  • [47] Fusing Attention with Visual Question Answering
    Burt, Ryan
    Cudic, Mihael
    Principe, Jose C.
    [J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
  • [48] Hierarchical Attention Networks for Fact-based Visual Question Answering
    Haibo Yao
    Yongkang Luo
    Zhi Zhang
    Jianhang Yang
    Chengtao Cai
    [J]. Multimedia Tools and Applications, 2024, 83 : 17281 - 17298
  • [49] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    [J]. IEEE ACCESS, 2020, 8 : 35662 - 35671
  • [50] Hierarchical Attention Networks for Fact-based Visual Question Answering
    Yao, Haibo
    Luo, Yongkang
    Zhang, Zhi
    Yang, Jianhang
    Cai, Chengtao
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 17281 - 17298