ST-VQA: shrinkage transformer with accurate alignment for visual question answering

被引:0
|
作者
Haiying Xia
Richeng Lan
Haisheng Li
Shuxiang Song
机构
[1] Guangxi Normal University,School of Electronic and Information Engineering
来源
Applied Intelligence | 2023年 / 53卷
关键词
Visual question answering; Alignment; Shrinkage transformer; Region fusion;
D O I
暂无
中图分类号
学科分类号
摘要
While transformer-based models have been remarkably successful in the field of visual question answering (VQA), their approaches to achieve vision and language feature alignment are simple and coarse. In recent years, this shortcoming has been further amplified with the popularity of vision-language pretraining, resulting in the slow development of an effective architecture for multimodal alignment. Based on this, we propose the shrinkage transformer-visual question answering (ST-VQA) framework. It aims to achieve more accurate multimodal alignment than the standard transformer. First, the ST-VQA framework uses the region feature of an image as a visual representation. Secondly, between the different Transformer layers, the ST-VQA framework reduces the number of visual regions in the transformer by feature fusion and ensures the difference between new regions by contrast loss. Finally, visual and textual features are fused and used for decision making answers. Many experiments demonstrate that without pretraining, our proposed method achieves better performance than the standard transformer and outperforms partial state-of-the-art methods on the VQA-v2 dataset.
引用
收藏
页码:20967 / 20978
页数:11
相关论文
共 50 条
  • [1] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
    Xia, Haiying
    Lan, Richeng
    Li, Haisheng
    Song, Shuxiang
    APPLIED INTELLIGENCE, 2023, 53 (18) : 20967 - 20978
  • [2] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [3] VQA: Visual Question Answering
    Agrawal, Aishwarya
    Lu, Jiasen
    Antol, Stanislaw
    Mitchell, Margaret
    Zitnick, C. Lawrence
    Parikh, Devi
    Batra, Dhruv
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
  • [4] SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering
    Xiong, Peixi
    You, Quanzeng
    Yu, Pei
    Liu, Zicheng
    Wu, Ying
    arXiv, 2022,
  • [5] Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer
    Seenivasan, Lalithkumar
    Islam, Mobarakol
    Krishna, Adithya K.
    Ren, Hongliang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 33 - 43
  • [6] VC-VQA: VISUAL CALIBRATION MECHANISM FOR VISUAL QUESTION ANSWERING
    Qiao, Yanyuan
    Yu, Zheng
    Liu, Jing
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1481 - 1485
  • [7] R-VQA: A robust visual question answering model
    Chowdhury, Souvik
    Soni, Badal
    KNOWLEDGE-BASED SYSTEMS, 2025, 309
  • [8] CQ-VQA: Visual Question Answering on Categorized Questions
    Mishra, Aakansha
    Anand, Ashish
    Guha, Prithwijit
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [9] FTN-VQA: MULTIMODAL REASONING BY LEVERAGING A FULLY TRANSFORMER-BASED NETWORK FOR VISUAL QUESTION ANSWERING
    Wang, Runmin
    Xu, Weixiang
    Zhu, Yanbin
    Zhu, Zhenlin
    Chen, Hua
    Ding, Yajun
    Liu, Jinping
    Gao, Changxin
    Sang, Nong
    FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2023, 31 (06)
  • [10] CS-VQA: VISUAL QUESTION ANSWERING WITH COMPRESSIVELY SENSED IMAGES
    Huang, Li-Chi
    Kulkarni, Kuldeep
    Jha, Anik
    Lohit, Suhas
    Jayasuriya, Suren
    Turaga, Pavan
    2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1283 - 1287