Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering

被引：1

作者：

Peng, Longkun ^{[1
,2
]}

An, Gaoyun ^{[1
,2
]}

Ruan, Qiuqi ^{[1
,2
]}

机构：

[1] Beijing Jiaotong Univ, Inst Informat Sci, Beijing 100044, Peoples R China

[2] Beijing Key Lab Adv Informat Sci & Network Techno, Beijing 100044, Peoples R China

来源：

2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1 | 2022年

基金：

中国国家自然科学基金;

关键词：

visual question answering; sparse; relevance scores; answer decoder;

D O I：

10.1109/ICSP56322.2022.9965298

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Visual Question Answering (VQA) is all about understanding images and questions. Existing Transformer-based methods achieve excellent performance by associating questions with image region objects and directly using a special classification token for answer prediction. However, answering a question only needs to focus on some specific keywords and image regions, while excessively computing the attention of questions and image region objects will introduce unnecessary noise. Meanwhile, the information from these two modalities cannot be fully utilized when directly using the classification token to predict the answer. To this end, we propose a Transformer-based Sparse Encoder and Answer Decoder (SEAD) model for visual question answering, in which a two-stream sparse Transformer module based on co-attention is built to enhance the most relevant visual features and textual descriptions inter-modality. Furthermore, a single-step answer decoder is proposed to fully exploit the information of both modalities in the answer prediction stage, and a strategy is designed that fully utilizes the ground truth to correct the visual relevance scores in the decoder to focus on salient objects in the image. Our model performs magnificently, as shown by experiment results on the VQA v2.0 benchmark dataset.

引用

页码：120 / 123

页数：4

共 50 条

[31] Towards a question answering assistant for software development using a transformer-based language model
Vale, Liliane do Nascimento
Maia, Marcelo de Almeida
[J]. 2021 IEEE/ACM THIRD INTERNATIONAL WORKSHOP ON BOTS IN SOFTWARE ENGINEERING (BOTSE 2021), 2021, : 39 - 42
[32] Sparse Transformer-based bins and Polarized Cross Attention decoder for monocular depth estimation
Wang, Hai-Kun
Du, Jiahui
Song, Ke
Cui, Limin
[J]. ENGINEERING SCIENCE AND TECHNOLOGY-AN INTERNATIONAL JOURNAL-JESTECH, 2024, 54
[33] Cluster-Former: Clustering-based Sparse Transformer for Question Answering
Wang, Shuohang
Zhou, Luowei
Gan, Zhe
Chen, Yen-Chun
Fang, Yuwei
Sun, Siqi
Cheng, Yu
Liu, Jingjing
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3958 - 3968
[34] Answer-Type Prediction for Visual Question Answering
Kafle, Kushal
Kanan, Christopher
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4976 - 4984
[35] Investigating Questioner's Explicit Information Influences in Transformer-based Community Question Answering
Maia, Macedo
Endres, Markus
[J]. 18TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC 2024, 2024, : 93 - 100
[36] RESTBERTa: a Transformer-based question answering approach for semantic search in Web API documentation
Kotstein, Sebastian
Decker, Christian
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (04): : 4035 - 4061
[37] Transformer-based Hierarchical Encoder for Document Classification
Sakhrani, Harsh
Parekh, Saloni
Ratadiya, Pratik
[J]. 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS ICDMW 2021, 2021, : 852 - 858
[38] Answer-Based Entity Extraction and Alignment for Visual Text Question Answering
Yu, Jun
Jing, Mohan
Liu, Weihao
Luo, Tongxu
Zhang, Bingyuan
Lu, Keda
Lei, Fangyu
Sun, Jianqing
Liang, Jiaen
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9487 - 9491
[39] Contrastive training of a multimodal encoder for medical visual question answering
Silva, Joao Daniel
Martins, Bruno
Magalhaes, Joao
[J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 18
[40] Medical image super-resolution via transformer-based hierarchical encoder-decoder network
Sun, Jianhao
Zeng, Xiangqin
Lei, Xiang
Gao, Mingliang
Li, Qilei
Zhang, Housheng
Ba, Fengli
[J]. NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2024, 13 (01):

← 1 2 3 4 5 →