ST-VQA: shrinkage transformer with accurate alignment for visual question answering

被引:0
|
作者
Haiying Xia
Richeng Lan
Haisheng Li
Shuxiang Song
机构
[1] Guangxi Normal University,School of Electronic and Information Engineering
来源
Applied Intelligence | 2023年 / 53卷
关键词
Visual question answering; Alignment; Shrinkage transformer; Region fusion;
D O I
暂无
中图分类号
学科分类号
摘要
While transformer-based models have been remarkably successful in the field of visual question answering (VQA), their approaches to achieve vision and language feature alignment are simple and coarse. In recent years, this shortcoming has been further amplified with the popularity of vision-language pretraining, resulting in the slow development of an effective architecture for multimodal alignment. Based on this, we propose the shrinkage transformer-visual question answering (ST-VQA) framework. It aims to achieve more accurate multimodal alignment than the standard transformer. First, the ST-VQA framework uses the region feature of an image as a visual representation. Secondly, between the different Transformer layers, the ST-VQA framework reduces the number of visual regions in the transformer by feature fusion and ensures the difference between new regions by contrast loss. Finally, visual and textual features are fused and used for decision making answers. Many experiments demonstrate that without pretraining, our proposed method achieves better performance than the standard transformer and outperforms partial state-of-the-art methods on the VQA-v2 dataset.
引用
收藏
页码:20967 / 20978
页数:11
相关论文
共 50 条
  • [21] WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering
    Chen, Pingyi
    Zhu, Chenglu
    Zheng, Sunyi
    Li, Honglin
    Yang, Lin
    COMPUTER VISION - ECCV 2024, PT XXXVI, 2025, 15094 : 401 - 417
  • [22] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Goyal, Yash
    Khot, Tejas
    Agrawal, Aishwarya
    Summers-Stay, Douglas
    Batra, Dhruv
    Parikh, Devi
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (04) : 398 - 414
  • [23] VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task
    Bi, Yandong
    Jiang, Huajie
    Liu, Jing
    Liu, Mengting
    Hu, Yongli
    Yin, Baocai
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XII, ICIC 2024, 2024, 14873 : 264 - 277
  • [24] Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering
    Naik, Nandita
    Potts, Christopher
    Kreiss, Elisa
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2813 - 2817
  • [25] Event-Oriented Visual Question Answering: The E-VQA Dataset and Benchmark
    Yang, Zhenguo
    Xiang, Jiale
    You, Jiuxiang
    Li, Qing
    Liu, Wenyin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (10) : 10210 - 10223
  • [26] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Yash Goyal
    Tejas Khot
    Aishwarya Agrawal
    Douglas Summers-Stay
    Dhruv Batra
    Devi Parikh
    International Journal of Computer Vision, 2019, 127 : 398 - 414
  • [27] Visual-Textual Semantic Alignment Network for Visual Question Answering
    Tian, Weidong
    Zhang, Yuzheng
    He, Bin
    Zhu, Junjun
    Zhao, Zhongqiu
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 259 - 270
  • [28] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
    Yu, Zhou
    Jin, Zitian
    Yu, Jun
    Xu, Mingliang
    Wang, Hongbo
    Fan, Jianping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556
  • [29] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723
  • [30] RESCUENET-VQA: A LARGE-SCALE VISUAL QUESTION ANSWERING BENCHMARK FOR DAMAGE ASSESSMENT
    Sarkar, Argho
    Rahnemoonfar, Maryam
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 1150 - 1153