TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention

被引:0
|
作者
Koshti, Dipali [1 ]
Gupta, Ashutosh [1 ]
Kalla, Mukesh [1 ]
Sharma, Arvind [2 ]
机构
[1] Sir Padampat Singhania Univ, Dept Comp Sci & Engn, Udaipur, Rajasthan, India
[2] Sir Padampat Singhania Univ, Dept Math, Udaipur, Rajasthan, India
关键词
Question-guided VQA; Visual question answering; Transformer-based VQA; BERT-based VQA;
D O I
10.4114/intartf.vol27iss73pp111-128
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Understanding multiple modalities and relating them is an easy task for humans. But for machines, this is a stimulating task. One such multi-modal reasoning task is Visual question answering which demands the machine to produce an answer for the natural language query asked based on the given image. Although plenty of work is done in this field, there is still a challenge of improving the answer prediction ability of the model and breaching human accuracy. A novel model for answering image-based questions based on a transformer has been proposed. The proposed model is a fully Transformer-based architecture that utilizes the power of a transformer for extracting language features as well as for performing joint understanding of question and image features. The proposed VQA model utilizes F-RCNN for image feature extraction. The retrieved language features and object-level image features are fed to a decoder inspired by the Bi-Directional Encoder Representation Transformer -BERT architecture that learns jointly the image characteristics directed by the question characteristics and rich representations of the image features are obtained. Extensive experimentation has been carried out to observe the effect of various hyperparameters on the performance of the model. The experimental results demonstrate that the model's ability to predict the answer increases with the increase in the number of layers in the transformer's encoder and decoder. The proposed model improves upon the previous models and is highly scalable due to the introduction of the BERT. Our best model reports 72.31% accuracy on the test-standard split of the VQAv2 dataset.
引用
收藏
页码:111 / 128
页数:18
相关论文
共 14 条
  • [1] TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention
    Koshti D.
    Gupta A.
    Kalla M.
    Sharma A.
    [J]. Inteligencia Artificial, 2024, 27 (73) : 111 - 128
  • [2] Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering
    Liu, Fei
    Liu, Jing
    Hong, Richang
    Lu, Hanqing
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1367 - 1379
  • [3] FTN-VQA: MULTIMODAL REASONING BY LEVERAGING A FULLY TRANSFORMER-BASED NETWORK FOR VISUAL QUESTION ANSWERING
    Wang, Runmin
    Xu, Weixiang
    Zhu, Yanbin
    Zhu, Zhenlin
    Chen, Hua
    Ding, Yajun
    Liu, Jinping
    Gao, Changxin
    Sang, Nong
    [J]. FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2023, 31 (06)
  • [4] A Transformer-based Medical Visual Question Answering Model
    Liu, Lei
    Su, Xiangdong
    Guo, Hui
    Zhu, Daobin
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1712 - 1718
  • [5] Efficient Open Domain Question Answering With Delayed Attention in Transformer-Based Models
    Siblini, Wissam
    Challal, Mohamed
    Pasqual, Charlotte
    [J]. INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2022, 18 (02)
  • [6] Transformer-based vision-language alignment for robot navigation and question answering
    Luo, Haonan
    Guo, Ziyu
    Wu, Zhenyu
    Teng, Fei
    Li, Tianrui
    [J]. INFORMATION FUSION, 2024, 108
  • [7] Towards a question answering assistant for software development using a transformer-based language model
    Vale, Liliane do Nascimento
    Maia, Marcelo de Almeida
    [J]. 2021 IEEE/ACM THIRD INTERNATIONAL WORKSHOP ON BOTS IN SOFTWARE ENGINEERING (BOTSE 2021), 2021, : 39 - 42
  • [8] A lightweight Transformer-based visual question answering network with Weight-Sharing Hybrid Attention
    Zhu, Yue
    Chen, Dongyue
    Jia, Tong
    Deng, Shizhuo
    [J]. NEUROCOMPUTING, 2024, 608
  • [9] Cross-attention Based Text-image Transformer for Visual Question Answering
    Rezapour, Mahdi
    [J]. Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
  • [10] User's Intention Understanding in Question-Answering System Using Attention-based LSTM
    Matsuyoshi, Yuki
    Takiguchi, Tetsuya
    Ariki, Yasuo
    [J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 1752 - 1755