TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention

被引：0

作者：

Koshti, Dipali ^{[1
]}

Gupta, Ashutosh ^{[1
]}

Kalla, Mukesh ^{[1
]}

Sharma, Arvind ^{[2
]}

机构：

[1] Sir Padampat Singhania Univ, Dept Comp Sci & Engn, Udaipur, Rajasthan, India

[2] Sir Padampat Singhania Univ, Dept Math, Udaipur, Rajasthan, India

来源：

INTELIGENCIA ARTIFICIAL-IBEROAMERICAL JOURNAL OF ARTIFICIAL INTELLIGENCE | 2024年 / 27卷 / 73期

关键词：

Question-guided VQA; Visual question answering; Transformer-based VQA; BERT-based VQA;

D O I：

10.4114/intartf.vol27iss73pp111-128

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Understanding multiple modalities and relating them is an easy task for humans. But for machines, this is a stimulating task. One such multi-modal reasoning task is Visual question answering which demands the machine to produce an answer for the natural language query asked based on the given image. Although plenty of work is done in this field, there is still a challenge of improving the answer prediction ability of the model and breaching human accuracy. A novel model for answering image-based questions based on a transformer has been proposed. The proposed model is a fully Transformer-based architecture that utilizes the power of a transformer for extracting language features as well as for performing joint understanding of question and image features. The proposed VQA model utilizes F-RCNN for image feature extraction. The retrieved language features and object-level image features are fed to a decoder inspired by the Bi-Directional Encoder Representation Transformer -BERT architecture that learns jointly the image characteristics directed by the question characteristics and rich representations of the image features are obtained. Extensive experimentation has been carried out to observe the effect of various hyperparameters on the performance of the model. The experimental results demonstrate that the model's ability to predict the answer increases with the increase in the number of layers in the transformer's encoder and decoder. The proposed model improves upon the previous models and is highly scalable due to the introduction of the BERT. Our best model reports 72.31% accuracy on the test-standard split of the VQAv2 dataset.

引用

页码：111 / 128

页数：18

共 14 条

[1] TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention
Koshti D.
Gupta A.
Kalla M.
Sharma A.
[J]. Inteligencia Artificial, 2024, 27 (73) : 111 - 128
[2] Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering
Liu, Fei
Liu, Jing
Hong, Richang
Lu, Hanqing
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1367 - 1379
[3] FTN-VQA: MULTIMODAL REASONING BY LEVERAGING A FULLY TRANSFORMER-BASED NETWORK FOR VISUAL QUESTION ANSWERING
Wang, Runmin
Xu, Weixiang
Zhu, Yanbin
Zhu, Zhenlin
Chen, Hua
Ding, Yajun
Liu, Jinping
Gao, Changxin
Sang, Nong
[J]. FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2023, 31 (06)
[4] A Transformer-based Medical Visual Question Answering Model
Liu, Lei
Su, Xiangdong
Guo, Hui
Zhu, Daobin
[J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1712 - 1718
[5] Efficient Open Domain Question Answering With Delayed Attention in Transformer-Based Models
Siblini, Wissam
Challal, Mohamed
Pasqual, Charlotte
[J]. INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2022, 18 (02)
[6] Transformer-based vision-language alignment for robot navigation and question answering
Luo, Haonan
Guo, Ziyu
Wu, Zhenyu
Teng, Fei
Li, Tianrui
[J]. INFORMATION FUSION, 2024, 108
[7] Towards a question answering assistant for software development using a transformer-based language model
Vale, Liliane do Nascimento
Maia, Marcelo de Almeida
[J]. 2021 IEEE/ACM THIRD INTERNATIONAL WORKSHOP ON BOTS IN SOFTWARE ENGINEERING (BOTSE 2021), 2021, : 39 - 42
[8] A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention
Zhu, Yue
Chen, Dongyue
Jia, Tong
Deng, Shizhuo
[J]. NEUROCOMPUTING, 2024, 608
[9] Cross-attention Based Text-image Transformer for Visual Question Answering
Rezapour, Mahdi
[J]. Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
[10] User's Intention Understanding in Question-Answering System Using Attention-based LSTM
Matsuyoshi, Yuki
Takiguchi, Tetsuya
Ariki, Yasuo
[J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1752 - 1755

← 1 2 →