ST-VQA: shrinkage transformer with accurate alignment for visual question answering

被引:0
|
作者
Haiying Xia
Richeng Lan
Haisheng Li
Shuxiang Song
机构
[1] Guangxi Normal University,School of Electronic and Information Engineering
来源
Applied Intelligence | 2023年 / 53卷
关键词
Visual question answering; Alignment; Shrinkage transformer; Region fusion;
D O I
暂无
中图分类号
学科分类号
摘要
While transformer-based models have been remarkably successful in the field of visual question answering (VQA), their approaches to achieve vision and language feature alignment are simple and coarse. In recent years, this shortcoming has been further amplified with the popularity of vision-language pretraining, resulting in the slow development of an effective architecture for multimodal alignment. Based on this, we propose the shrinkage transformer-visual question answering (ST-VQA) framework. It aims to achieve more accurate multimodal alignment than the standard transformer. First, the ST-VQA framework uses the region feature of an image as a visual representation. Secondly, between the different Transformer layers, the ST-VQA framework reduces the number of visual regions in the transformer by feature fusion and ensures the difference between new regions by contrast loss. Finally, visual and textual features are fused and used for decision making answers. Many experiments demonstrate that without pretraining, our proposed method achieves better performance than the standard transformer and outperforms partial state-of-the-art methods on the VQA-v2 dataset.
引用
收藏
页码:20967 / 20978
页数:11
相关论文
共 50 条
  • [31] A Transformer-based Medical Visual Question Answering Model
    Liu, Lei
    Su, Xiangdong
    Guo, Hui
    Zhu, Daobin
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1712 - 1718
  • [32] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [33] VISION AND TEXT TRANSFORMER FOR PREDICTING ANSWERABILITY ON VISUAL QUESTION ANSWERING
    Le, Tung
    Huy Tien Nguyen
    Minh Le Nguyen
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 934 - 938
  • [34] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
    Zhou, Yiyi
    Ren, Tianhe
    Zhu, Chaoyang
    Sun, Xiaoshuai
    Liu, Jianzhuang
    Ding, Xinghao
    Xu, Mingliang
    Ji, Rongrong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064
  • [35] Transformer Module Networks for Systematic Generalization in Visual Question Answering
    Yamada, Moyuru
    D'Amario, Vanessa
    Takemoto, Kentaro
    Boix, Xavier
    Sasaki, Tomotake
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10096 - 10105
  • [36] Spot the Difference: Difference Visual Question Answering with Residual Alignment
    Lu, Zilin
    Xie, Yutong
    Zeng, Qingjie
    Lu, Mengkang
    Wu, Qi
    Xia, Yong
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT V, 2024, 15005 : 649 - 658
  • [37] From image to language: A critical analysis of Visual Question Answering (VQA) approaches, challenges, and opportunities
    Ishmam, Md. Farhan
    Shovon, Md. Sakib Hossain
    Mridha, M. F.
    Dey, Nilanjan
    INFORMATION FUSION, 2024, 106
  • [38] Feasibility of Visual Question Answering (VQA) for Post-Disaster Damage Detection Using Aerial Footage
    Lowande, Rafael De Sa
    Sevil, Hakki Erhan
    APPLIED SCIENCES-BASEL, 2023, 13 (08):
  • [39] Self-Adaptive Neural Module Transformer for Visual Question Answering
    Zhong, Huasong
    Chen, Jingyuan
    Shen, Chen
    Zhang, Hanwang
    Huang, Jianqiang
    Hua, Xian-Sheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 1264 - 1273
  • [40] A CASCADED LONG SHORT-TERM MEMORY (LSTM) DRIVEN GENERIC VISUAL QUESTION ANSWERING (VQA)
    Chowdhury, Iqbal
    Kien Nguyen
    Fookes, Clinton
    Sridharan, Sridha
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 1842 - 1846