Local self-attention in transformer for visual question answering

被引：0

作者：

Xiang Shen

Dezhi Han

Zihan Guo

Chongqing Chen

Jie Hua

Gaofeng Luo

机构：

[1] Shanghai Maritime University,College of Information Engineering

[2] University of Technology,TD School

[3] Shaoyang University,College of Information Engineering

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Transformer; Local self-attention; Grid/regional visual features; Visual question answering;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Visual Question Answering (VQA) is a multimodal task that requires models to understand both textual and visual information. Various VQA models have applied the Transformer structure due to its excellent ability to model self-attention global dependencies. However, balancing global and local dependency modeling in traditional Transformer structures is an ongoing issue. A Transformer-based VQA model that only models global dependencies cannot effectively capture image context information. Thus, this paper proposes a novel Local Self-Attention in Transformer (LSAT) for a visual question answering model to address these issues. The LSAT model simultaneously models intra-window and inter-window attention by setting local windows for visual features. Therefore, the LSAT model can effectively avoid redundant information in global self-attention while capturing rich contextual information. This paper uses grid visual features to conduct extensive experiments and ablation studies on the VQA benchmark datasets VQA 2.0 and CLEVR. The experimental results show that the LSAT model outperforms the benchmark model in all indicators when the appropriate local window size is selected. Specifically, the best test results of LSAT using grid visual features on the VQA 2.0 and CLEVR datasets were 71.94% and 98.72%, respectively. Experimental results and ablation studies demonstrate that the proposed method has good performance. Source code is available at https://github.com/shenxiang-vqa/LSAT.

引用

页码：16706 / 16723

页数：17

共 50 条

[1] Local self-attention in transformer for visual question answering
Shen, Xiang
Han, Dezhi
Guo, Zihan
Chen, Chongqing
Hua, Jie
Luo, Gaofeng
APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
[2] Stacked Self-Attention Networks for Visual Question Answering
Sun, Qiang
Fu, Yanwei
ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
[3] ASAM: Asynchronous Self-Attention Model for Visual Question Answering
Liu, Han
Han, Dezhi
Zhang, Shukai
Shi, Jingya
Wu, Huafeng
Zhou, Yachao
Li, Kuan-Ching
COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2025, 22 (01)
[4] Dual self-attention with co-attention networks for visual question answering
Liu, Yun
Zhang, Xiaoming
Zhang, Qianyun
Li, Chaozhuo
Huang, Feiran
Tang, Xianghong
Li, Zhoujun
PATTERN RECOGNITION, 2021, 117 (117)
[5] Intra-Modality Feature Interaction Using Self-attention for Visual Question Answering
Shao, Huan
Xu, Yunlong
Ji, Yi
Yang, Jianyu
Liu, Chunping
NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 215 - 222
[6] A novel self-attention enriching mechanism for biomedical question answering
Kaddari, Zakaria
Bouchentouf, Toumi
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 225
[7] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
Zhou, Yiyi
Ren, Tianhe
Zhu, Chaoyang
Sun, Xiaoshuai
Liu, Jianzhuang
Ding, Xinghao
Xu, Mingliang
Ji, Rongrong
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064
[8] Multi-page Document Visual Question Answering Using Self-attention Scoring Mechanism
Kang, Lei
Tito, Ruben
Valveny, Ernest
Karatzas, Dimosthenis
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT VI, 2024, 14809 : 219 - 232
[9] SAFFNet: self-attention based on Fourier frequency domain filter network for visual question answering
Shi, Jingya
Han, Dezhi
Chen, Chongqing
Shen, Xiang
VISUAL COMPUTER, 2025,
[10] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
Zhang, Haotian
Wu, Wei
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,

← 1 2 3 4 5 →