A Study of Visual Question Answering Techniques Based on Collaborative Multi-Head Attention

被引:1
|
作者
Yang, Yingli [1 ]
Jin, Jingxuan [1 ]
Li, De [2 ]
机构
[1] Yanbian Univ, Inst Intelligent Informat Proc, Yanji, Peoples R China
[2] Yanbian Univ, Dept Compter Sci & Technol, Yanji, Peoples R China
基金
中国国家自然科学基金;
关键词
visual question answering; pre-training; collaborative multi-head attention; Swin transformer;
D O I
10.1109/ACCTCS58815.2023.00037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In visual question answering task, the dominant approach recently has been to use a unified model for pre-training and fine tuning it. This unified model typically uses a transformer to fuse image and text information. In order to optimize the performance of the model on visual question answering task, this paper proposes a transformer architecture based on a collaborative multi-head attention mechanism to address the key/value projection redundancy problem in the multi-head attention mechanism of the transformer. In addition, this paper uses the Swin transformer model as the image feature extractor to extract multi-scale image information. Validation experiments are conducted on the VQA v2 dataset in this paper, and the experimental results show that applying the collaborative multi-head attention approach and the Swin transformer backbone to the visual question answering model can effectively improve the correct rate of the visual question answering task.
引用
收藏
页码:552 / 555
页数:4
相关论文
共 50 条
  • [31] Dual-feature collaborative relation-attention networks for visual question answering
    Yao, Lu
    Yang, You
    Hu, Juntao
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (02)
  • [32] Dual-feature collaborative relation-attention networks for visual question answering
    Lu Yao
    You Yang
    Juntao Hu
    [J]. International Journal of Multimedia Information Retrieval, 2023, 12
  • [33] Multi-source Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Tian, Xinmei
    Mei, Tao
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
  • [34] Guiding Visual Question Answering with Attention Priors
    Le, Thao Minh
    Le, Vuong
    Gupta, Sunil
    Venkatesh, Svetha
    Tran, Truyen
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4370 - 4379
  • [35] Re-Attention for Visual Question Answering
    Guo, Wenya
    Zhang, Ying
    Yang, Jufeng
    Yuan, Xiaojie
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 6730 - 6743
  • [36] Re-Attention for Visual Question Answering
    Guo, Wenya
    Zhang, Ying
    Wu, Xiaoping
    Yang, Jufeng
    Cai, Xiangrui
    Yuan, Xiaojie
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 91 - 98
  • [37] Feature Enhancement in Attention for Visual Question Answering
    Lin, Yuetan
    Pang, Zhangyang
    Wang, Donghui
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4216 - 4222
  • [38] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    [J]. ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
  • [39] Dynamic Capsule Attention for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Su, Jinsong
    Sun, Xiaoshuai
    Chen, Weiqiu
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9324 - 9331
  • [40] Multi-Head Attention with Disagreement Regularization
    Li, Jian
    Tu, Zhaopeng
    Yang, Baosong
    Lyu, Michael R.
    Zhang, Tong
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2897 - 2903