ST-VQA: shrinkage transformer with accurate alignment for visual question answering

被引:0
|
作者
Haiying Xia
Richeng Lan
Haisheng Li
Shuxiang Song
机构
[1] Guangxi Normal University,School of Electronic and Information Engineering
来源
Applied Intelligence | 2023年 / 53卷
关键词
Visual question answering; Alignment; Shrinkage transformer; Region fusion;
D O I
暂无
中图分类号
学科分类号
摘要
While transformer-based models have been remarkably successful in the field of visual question answering (VQA), their approaches to achieve vision and language feature alignment are simple and coarse. In recent years, this shortcoming has been further amplified with the popularity of vision-language pretraining, resulting in the slow development of an effective architecture for multimodal alignment. Based on this, we propose the shrinkage transformer-visual question answering (ST-VQA) framework. It aims to achieve more accurate multimodal alignment than the standard transformer. First, the ST-VQA framework uses the region feature of an image as a visual representation. Secondly, between the different Transformer layers, the ST-VQA framework reduces the number of visual regions in the transformer by feature fusion and ensures the difference between new regions by contrast loss. Finally, visual and textual features are fused and used for decision making answers. Many experiments demonstrate that without pretraining, our proposed method achieves better performance than the standard transformer and outperforms partial state-of-the-art methods on the VQA-v2 dataset.
引用
收藏
页码:20967 / 20978
页数:11
相关论文
共 50 条
  • [41] Vision-Language Transformer for Interpretable Pathology Visual Question Answering
    Naseem, Usman
    Khushi, Matloob
    Kim, Jinman
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1681 - 1690
  • [42] CAT: Re-Conv Attention in Transformer for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1471 - 1477
  • [43] Fair-VQA: Fairness-Aware Visual Question Answering Through Sensitive Attribute Prediction
    Park, Sungho
    Hwang, Sunhee
    Hong, Jongkwang
    Byun, Hyeran
    IEEE ACCESS, 2020, 8 : 215091 - 215099
  • [44] VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering
    Wang, Yanan
    Yasunaga, Michihiro
    Ren, Hongyu
    Wada, Shinya
    Leskovec, Jure
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21525 - 21535
  • [45] TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention
    Koshti, Dipali
    Gupta, Ashutosh
    Kalla, Mukesh
    Sharma, Arvind
    INTELIGENCIA ARTIFICIAL-IBEROAMERICAL JOURNAL OF ARTIFICIAL INTELLIGENCE, 2024, 27 (73): : 111 - 128
  • [46] TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention
    Koshti D.
    Gupta A.
    Kalla M.
    Sharma A.
    Inteligencia Artificial, 2024, 27 (73) : 111 - 128
  • [47] Transformer-based vision-language alignment for robot navigation and question answering
    Luo, Haonan
    Guo, Ziyu
    Wu, Zhenyu
    Teng, Fei
    Li, Tianrui
    INFORMATION FUSION, 2024, 108
  • [48] Post-Disaster Damage Detection using Aerial Footage: Visual Question Answering (VQA) Case Study
    Lowande, Rafael De Sa
    Mahyari, Arash
    Sevil, Hakki Erhan
    2022 IEEE APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, AIPR, 2022,
  • [49] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [50] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,