Local self-attention in transformer for visual question answering

被引:0
|
作者
Xiang Shen
Dezhi Han
Zihan Guo
Chongqing Chen
Jie Hua
Gaofeng Luo
机构
[1] Shanghai Maritime University,College of Information Engineering
[2] University of Technology,TD School
[3] Shaoyang University,College of Information Engineering
来源
Applied Intelligence | 2023年 / 53卷
关键词
Transformer; Local self-attention; Grid/regional visual features; Visual question answering;
D O I
暂无
中图分类号
学科分类号
摘要
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at https://github.com/shenxiang-vqa/LSAT.
引用
收藏
页码:16706 / 16723
页数:17
相关论文
共 50 条
  • [1] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [2] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [3] ASAM: Asynchronous Self-Attention Model for Visual Question Answering
    Liu, Han
    Han, Dezhi
    Zhang, Shukai
    Shi, Jingya
    Wu, Huafeng
    Zhou, Yachao
    Li, Kuan-Ching
    COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2025, 22 (01)
  • [4] Dual self-attention with co-attention networks for visual question answering
    Liu, Yun
    Zhang, Xiaoming
    Zhang, Qianyun
    Li, Chaozhuo
    Huang, Feiran
    Tang, Xianghong
    Li, Zhoujun
    PATTERN RECOGNITION, 2021, 117 (117)
  • [5] Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering
    Shao, Huan
    Xu, Yunlong
    Ji, Yi
    Yang, Jianyu
    Liu, Chunping
    NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 215 - 222
  • [6] A novel self-attention enriching mechanism for biomedical question answering
    Kaddari, Zakaria
    Bouchentouf, Toumi
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 225
  • [7] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
    Zhou, Yiyi
    Ren, Tianhe
    Zhu, Chaoyang
    Sun, Xiaoshuai
    Liu, Jianzhuang
    Ding, Xinghao
    Xu, Mingliang
    Ji, Rongrong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064
  • [8] Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism
    Kang, Lei
    Tito, Ruben
    Valveny, Ernest
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 219 - 232
  • [9] SAFFNet: self-attention based on Fourier frequency domain filter network for visual question answering
    Shi, Jingya
    Han, Dezhi
    Chen, Chongqing
    Shen, Xiang
    VISUAL COMPUTER, 2025,
  • [10] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,