Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering

被引:1
|
作者
Peng, Longkun [1 ,2 ]
An, Gaoyun [1 ,2 ]
Ruan, Qiuqi [1 ,2 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
基金
中国国家自然科学基金;
关键词
visual question answering; sparse; relevance scores; answer decoder;
D O I
10.1109/ICSP56322.2022.9965298
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Visual Question Answering (VQA) is all about understanding images and questions. Existing Transformer-based methods achieve excellent performance by associating questions with image region objects and directly using a special classification token for answer prediction. However, answering a question only needs to focus on some specific keywords and image regions, while excessively computing the attention of questions and image region objects will introduce unnecessary noise. Meanwhile, the information from these two modalities cannot be fully utilized when directly using the classification token to predict the answer. To this end, we propose a Transformer-based Sparse Encoder and Answer Decoder (SEAD) model for visual question answering, in which a two-stream sparse Transformer module based on co-attention is built to enhance the most relevant visual features and textual descriptions inter-modality. Furthermore, a single-step answer decoder is proposed to fully exploit the information of both modalities in the answer prediction stage, and a strategy is designed that fully utilizes the ground truth to correct the visual relevance scores in the decoder to focus on salient objects in the image. Our model performs magnificently, as shown by experiment results on the VQA v2.0 benchmark dataset.
引用
收藏
页码:120 / 123
页数:4
相关论文
共 50 条
  • [31] Towards a question answering assistant for software development using a transformer-based language model
    Vale, Liliane do Nascimento
    Maia, Marcelo de Almeida
    [J]. 2021 IEEE/ACM THIRD INTERNATIONAL WORKSHOP ON BOTS IN SOFTWARE ENGINEERING (BOTSE 2021), 2021, : 39 - 42
  • [32] Sparse Transformer-based bins and Polarized Cross Attention decoder for monocular depth estimation
    Wang, Hai-Kun
    Du, Jiahui
    Song, Ke
    Cui, Limin
    [J]. ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2024, 54
  • [33] Cluster-Former: Clustering-based Sparse Transformer for Question Answering
    Wang, Shuohang
    Zhou, Luowei
    Gan, Zhe
    Chen, Yen-Chun
    Fang, Yuwei
    Sun, Siqi
    Cheng, Yu
    Liu, Jingjing
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3958 - 3968
  • [34] Answer-Type Prediction for Visual Question Answering
    Kafle, Kushal
    Kanan, Christopher
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4976 - 4984
  • [35] Investigating Questioner's Explicit Information Influences in Transformer-based Community Question Answering
    Maia, Macedo
    Endres, Markus
    [J]. 18TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC 2024, 2024, : 93 - 100
  • [36] RESTBERTa: a Transformer-based question answering approach for semantic search in Web API documentation
    Kotstein, Sebastian
    Decker, Christian
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (04): : 4035 - 4061
  • [37] Transformer-based Hierarchical Encoder for Document Classification
    Sakhrani, Harsh
    Parekh, Saloni
    Ratadiya, Pratik
    [J]. 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 852 - 858
  • [38] Answer-Based Entity Extraction and Alignment for Visual Text Question Answering
    Yu, Jun
    Jing, Mohan
    Liu, Weihao
    Luo, Tongxu
    Zhang, Bingyuan
    Lu, Keda
    Lei, Fangyu
    Sun, Jianqing
    Liang, Jiaen
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9487 - 9491
  • [39] Contrastive training of a multimodal encoder for medical visual question answering
    Silva, Joao Daniel
    Martins, Bruno
    Magalhaes, Joao
    [J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 18
  • [40] Medical image super-resolution via transformer-based hierarchical encoder-decoder network
    Sun, Jianhao
    Zeng, Xiangqin
    Lei, Xiang
    Gao, Mingliang
    Li, Qilei
    Zhang, Housheng
    Ba, Fengli
    [J]. NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2024, 13 (01):