Adaptive sparse triple convolutional attention for enhanced visual question answering

被引:0
|
作者
Wang, Ronggui [1 ]
Chen, Hong [1 ]
Yang, Juan [1 ]
Xue, Lixia [1 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Peoples R China
来源
关键词
Visual question answering; Transformer; Sparse attention; Convolutional attention;
D O I
10.1007/s00371-025-03812-0
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we propose ASTCAN, an adaptive sparse triple convolutional attention network, designed to enhance visual question answering (VQA) by introducing innovative modifications to the standard Transformer architecture. Traditional VQA models often struggle with noise interference from irrelevant regions due to their inability to dynamically filter out extraneous features. ASTCAN addresses this limitation through an adaptive threshold sparse attention mechanism, which dynamically filters irrelevant features during training, significantly improving focus and efficiency. Additionally, we introduce a triple convolutional attention module, which extends the Transformer by capturing cross-dimensional interactions between spatial and channel features, further enhancing the model's reasoning ability. Extensive experiments on benchmark datasets demonstrate that ASTCAN outperforms most existing end-to-end methods, particularly in scenarios without pre-training, highlighting its effectiveness and potential for real-world applications. The code and datasets are publicly available to facilitate reproducibility and further research.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
  • [2] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [3] GFSNet: Gaussian Fourier with sparse attention network for visual question answering
    Shen, Xiang
    Han, Dezhi
    Chang, Chin-Chen
    Oad, Ammar
    Wu, Huafeng
    ARTIFICIAL INTELLIGENCE REVIEW, 2025, 58 (06)
  • [4] Co-attention graph convolutional network for visual question answering
    Liu, Chuan
    Tan, Ying-Ying
    Xia, Tian-Tian
    Zhang, Jiajing
    Zhu, Ming
    MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
  • [5] Co-attention graph convolutional network for visual question answering
    Chuan Liu
    Ying-Ying Tan
    Tian-Tian Xia
    Jiajing Zhang
    Ming Zhu
    Multimedia Systems, 2023, 29 : 2527 - 2543
  • [6] Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering
    Shen, Xiang
    Han, Dezhi
    Chang, Chin-Chen
    Zong, Liang
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 785 - 796
  • [7] Sparse co-attention visual question answering networks based on thresholds
    Zihan Guo
    Dezhi Han
    Applied Intelligence, 2023, 53 : 586 - 600
  • [8] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
    Guo, Zihan
    Han, Dezhi
    SENSORS, 2020, 20 (23) : 1 - 15
  • [9] Sparse co-attention visual question answering networks based on thresholds
    Guo, Zihan
    Han, Dezhi
    APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600
  • [10] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662