Local self-attention in transformer for visual question answering

被引:0
|
作者
Xiang Shen
Dezhi Han
Zihan Guo
Chongqing Chen
Jie Hua
Gaofeng Luo
机构
[1] Shanghai Maritime University,College of Information Engineering
[2] University of Technology,TD School
[3] Shaoyang University,College of Information Engineering
来源
Applied Intelligence | 2023年 / 53卷
关键词
Transformer; Local self-attention; Grid/regional visual features; Visual question answering;
D O I
暂无
中图分类号
学科分类号
摘要
Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at https://github.com/shenxiang-vqa/LSAT.
引用
收藏
页码:16706 / 16723
页数:17
相关论文
共 50 条
  • [41] Dynamic Capsule Attention for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Su, Jinsong
    Sun, Xiaoshuai
    Chen, Weiqiu
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9324 - 9331
  • [42] How Self-Attention Improves Rare Class Performance in a Question-Answering Dialogue Agent
    Stiff, Adam
    Song, Qi
    Fosler-Lussier, Eric
    SIGDIAL 2020: 21ST ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2020), 2020, : 196 - 202
  • [43] Generative Attention Model with Adversarial Self-learning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    PROCEEDINGS OF THE THEMATIC WORKSHOPS OF ACM MULTIMEDIA 2017 (THEMATIC WORKSHOPS'17), 2017, : 415 - 423
  • [44] RVT-Transformer: Residual Attention in Answerability Prediction on Visual Question Answering for Blind People
    Duy-Minh Nguyen-Tran
    Tung Le
    Khoa Pho
    Minh Le Nguyen
    Huy Tien Nguyen
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2022, 2022, 1653 : 423 - 435
  • [45] Advancing Vietnamese Visual Question Answering with Transformer and Convolutional
    Nguyen, Ngoc Son
    Nguyen, Van Son
    Le, Tung
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 119
  • [46] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [47] Light-Weight Vision Transformer with Parallel Local and Global Self-Attention
    Ebert, Nikolas
    Reichardt, Laurenz
    Stricker, Didier
    Wasenmueller, Oliver
    2023 IEEE 26TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS, ITSC, 2023, : 452 - 459
  • [48] PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention
    Ebert, Nikolas
    Stricker, Didier
    Wasenmueller, Oliver
    SENSORS, 2023, 23 (07)
  • [49] Local-Global Self-Attention for Transformer-Based Object Tracking
    Chen, Langkun
    Gao, Long
    Jiang, Yan
    Li, Yunsong
    He, Gang
    Ning, Jifeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12316 - 12329
  • [50] Universal Graph Transformer Self-Attention Networks
    Dai Quoc Nguyen
    Tu Dinh Nguyen
    Dinh Phung
    COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 193 - 196