Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering

被引:1
|
作者
Peng, Longkun [1 ,2 ]
An, Gaoyun [1 ,2 ]
Ruan, Qiuqi [1 ,2 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
基金
中国国家自然科学基金;
关键词
visual question answering; sparse; relevance scores; answer decoder;
D O I
10.1109/ICSP56322.2022.9965298
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Visual Question Answering (VQA) is all about understanding images and questions. Existing Transformer-based methods achieve excellent performance by associating questions with image region objects and directly using a special classification token for answer prediction. However, answering a question only needs to focus on some specific keywords and image regions, while excessively computing the attention of questions and image region objects will introduce unnecessary noise. Meanwhile, the information from these two modalities cannot be fully utilized when directly using the classification token to predict the answer. To this end, we propose a Transformer-based Sparse Encoder and Answer Decoder (SEAD) model for visual question answering, in which a two-stream sparse Transformer module based on co-attention is built to enhance the most relevant visual features and textual descriptions inter-modality. Furthermore, a single-step answer decoder is proposed to fully exploit the information of both modalities in the answer prediction stage, and a strategy is designed that fully utilizes the ground truth to correct the visual relevance scores in the decoder to focus on salient objects in the image. Our model performs magnificently, as shown by experiment results on the VQA v2.0 benchmark dataset.
引用
收藏
页码:120 / 123
页数:4
相关论文
共 50 条
  • [1] Transformer-Based Neural Network for Answer Selection in Question Answering
    Shao, Taihua
    Guo, Yupu
    Chen, Honghui
    Hao, Zepeng
    [J]. IEEE ACCESS, 2019, 7 : 26146 - 26156
  • [2] Dual-decoder transformer network for answer grounding in visual question answering
    Zhu, Liangjun
    Peng, Li
    Zhou, Weinan
    Yang, Jielong
    [J]. PATTERN RECOGNITION LETTERS, 2023, 171 : 53 - 60
  • [3] A Transformer-based Medical Visual Question Answering Model
    Liu, Lei
    Su, Xiangdong
    Guo, Hui
    Zhu, Daobin
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1712 - 1718
  • [4] Attention-based encoder-decoder model for answer selection in question answering
    Yuan-ping Nie
    Yi Han
    Jiu-ming Huang
    Bo Jiao
    Ai-ping Li
    [J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 535 - 544
  • [5] Attention-based encoder-decoder model for answer selection in question answering
    Nie, Yuan-ping
    Han, Yi
    Huang, Jiu-ming
    Jiao, Bo
    Li, Ai-ping
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (04) : 535 - 544
  • [6] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    [J]. IEEE ACCESS, 2020, 8 : 35662 - 35671
  • [7] Entity-aware answer sentence selection for question answering with transformer-based language models
    Abbasiantaeb, Zahra
    Momtazi, Saeedeh
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2022, 59 (03) : 755 - 777
  • [8] Entity-aware answer sentence selection for question answering with transformer-based language models
    Zahra Abbasiantaeb
    Saeedeh Momtazi
    [J]. Journal of Intelligent Information Systems, 2022, 59 : 755 - 777
  • [9] Encoder-decoder cycle for visual question answering based on perception-action cycle
    Mohamud, Safaa Abdullahi Moallim
    Jalali, Amin
    Lee, Minho
    [J]. PATTERN RECOGNITION, 2023, 144
  • [10] Transformer-based Encoder-Decoder Model for Surface Defect Detection
    Lu, Xiaofeng
    Fan, Wentao
    [J]. 6TH INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE, ICIAI2022, 2022, : 125 - 130