Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering

被引:1
|
作者
Peng, Longkun [1 ,2 ]
An, Gaoyun [1 ,2 ]
Ruan, Qiuqi [1 ,2 ]
机构
[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China
[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China
基金
中国国家自然科学基金;
关键词
visual question answering; sparse; relevance scores; answer decoder;
D O I
10.1109/ICSP56322.2022.9965298
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Visual Question Answering (VQA) is all about understanding images and questions. Existing Transformer-based methods achieve excellent performance by associating questions with image region objects and directly using a special classification token for answer prediction. However, answering a question only needs to focus on some specific keywords and image regions, while excessively computing the attention of questions and image region objects will introduce unnecessary noise. Meanwhile, the information from these two modalities cannot be fully utilized when directly using the classification token to predict the answer. To this end, we propose a Transformer-based Sparse Encoder and Answer Decoder (SEAD) model for visual question answering, in which a two-stream sparse Transformer module based on co-attention is built to enhance the most relevant visual features and textual descriptions inter-modality. Furthermore, a single-step answer decoder is proposed to fully exploit the information of both modalities in the answer prediction stage, and a strategy is designed that fully utilizes the ground truth to correct the visual relevance scores in the decoder to focus on salient objects in the image. Our model performs magnificently, as shown by experiment results on the VQA v2.0 benchmark dataset.
引用
收藏
页码:120 / 123
页数:4
相关论文
共 50 条
  • [21] AnoViT: Unsupervised Anomaly Detection and Localization With Vision Transformer-Based Encoder-Decoder
    Lee, Yunseung
    Kang, Pilsung
    [J]. IEEE ACCESS, 2022, 10 : 46717 - 46724
  • [22] An Answer FeedBack Network for Visual Question Answering
    Tian, Weidong
    Tian, Ruihua
    Zhao, Zhongqiu
    Ren, Quan
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [23] Efficient Open Domain Question Answering With Delayed Attention in Transformer-Based Models
    Siblini, Wissam
    Challal, Mohamed
    Pasqual, Charlotte
    [J]. INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2022, 18 (02)
  • [24] Reading comprehension based question answering system in Bangla language with transformer-based learning
    Aurpa, Tanjim Taharat
    Rifat, Richita Khandakar
    Ahmed, Md Shoaib
    Anwar, Md Musfique
    Ali, A. B. M. Shawkat
    [J]. HELIYON, 2022, 8 (10)
  • [25] Transformer-based vision-language alignment for robot navigation and question answering
    Luo, Haonan
    Guo, Ziyu
    Wu, Zhenyu
    Teng, Fei
    Li, Tianrui
    [J]. INFORMATION FUSION, 2024, 108
  • [26] DAM: Transformer-based relation detection for Question Answering over Knowledge Base
    Chen, Yongrui
    Li, Huiying
    [J]. KNOWLEDGE-BASED SYSTEMS, 2020, 201
  • [27] Sparse Transformer-Based Sequence Generation for Visual Object Tracking
    Tian, Dan
    Liu, Dong-Xin
    Wang, Xiao
    Hao, Ying
    [J]. IEEE Access, 2024, 12 : 154418 - 154425
  • [28] On the role of question encoder sequence model in robust visual question answering
    Kv, Gouthaman
    Mittal, Anurag
    [J]. PATTERN RECOGNITION, 2022, 131
  • [29] Improving visual question answering using dropout and enhanced question encoder
    Fang, Zhiwei
    Liu, Jing
    Li, Yong
    Qiao, Yanyuan
    Lu, Hanqing
    [J]. PATTERN RECOGNITION, 2019, 90 : 404 - 414
  • [30] Question Answering Based on Answer Trustworthiness
    Oh, Hyo-Jung
    Lee, Chung-Hee
    Yoon, Yeo-Chan
    Jang, Myung-Gil
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2009, 5839 : 310 - 317