Dual self-attention with co-attention networks for visual question answering

被引:40
|
作者
Liu, Yun [1 ,2 ]
Zhang, Xiaoming [3 ]
Zhang, Qianyun [3 ]
Li, Chaozhuo [4 ]
Huang, Feiran [5 ]
Tang, Xianghong [6 ]
Li, Zhoujun [2 ]
机构
[1] Beijing Informat Sci & Technol Univ, Beijing Key Lab Internet Culture & Digital Dissem, Beijing, Peoples R China
[2] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing, Peoples R China
[3] Beihang Univ, Sch Cyber Sci & Technol, Beijing, Peoples R China
[4] Microsoft Res Asia, Beijing, Peoples R China
[5] Jinan Univ, Coll Informat Sci & Technol, Coll Cyber Secur, Guangzhou, Peoples R China
[6] Guizhou Univ, Key Lab Adv Mfg Technol, Minist Educ, Guiyang, Peoples R China
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Self-attention; Visual-textual co-attention; Visual question answering;
D O I
10.1016/j.patcog.2021.107956
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) as an important task in understanding vision and language has been proposed and aroused wide interests. In previous VQA methods, Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are generally used to extract visual and textual features respectively, and then the correlation between these two features is explored to infer the answer. However, CNN mainly focuses on extracting local spatial information and RNN pays more attention on exploiting sequential architecture and long-range dependencies. It is difficult for them to integrate the local features with their global dependencies to learn more effective representations of the image and question. To address this problem, we propose a novel model, i.e., Dual Self-Attention with Co-Attention networks (DSACA), for VQA. It aims to model the internal dependencies of both the spatial and sequential structure respectively by using the newly proposed self-attention mechanism. Specifically, DSACA mainly contains three sub modules. The visual self-attention module selectively aggregates the visual features at each region by a weighted sum of the features at all positions. The textual self-attention module automatically emphasizes the interdependent word features by integrating associated features among the sentence words. Besides, the visual-textual co-attention module explores the close correlation between visual and textual features learned from self-attention modules. The three modules are integrated into an end-to-end framework to infer the answer. Extensive experiments performed on three generally used VQA datasets confirm the favorable performance of DSACA compared with state-of-the-art methods. 0 2021 Elsevier Ltd. All rights reserved.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Co-attention Network for Visual Question Answering Based on Dual Attention
    Dong, Feng
    Wang, Xiaofeng
    Oad, Ammar
    Talpur, Mir Sajjad Hussain
    [J]. Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123
  • [2] Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering
    Li, Xiangpeng
    Song, Jingkuan
    Gao, Lianli
    Liu, Xianglong
    Huang, Wenbing
    He, Xiangnan
    Gan, Chuang
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8658 - 8665
  • [3] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    [J]. ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [4] Deep Modular Co-Attention Networks for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Cui, Yuhao
    Tao, Dacheng
    Tian, Qi
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
  • [5] An Effective Dense Co-Attention Networks for Visual Question Answering
    He, Shirong
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (17) : 1 - 15
  • [6] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    [J]. SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [7] Sparse co-attention visual question answering networks based on thresholds
    Guo, Zihan
    Han, Dezhi
    [J]. APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600
  • [8] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    [J]. Soft Computing, 2021, 25 : 5411 - 5421
  • [9] Sparse co-attention visual question answering networks based on thresholds
    Zihan Guo
    Dezhi Han
    [J]. Applied Intelligence, 2023, 53 : 586 - 600
  • [10] IMCN: Improved modular co-attention networks for visual question answering
    Liu, Cheng
    Wang, Chao
    Peng, Yan
    [J]. APPLIED INTELLIGENCE, 2024, 54 (06) : 5167 - 5182