Advancing Vietnamese Visual Question Answering with Transformer and Convolutional

被引:0
|
作者
Nguyen, Ngoc Son [1 ,3 ]
Nguyen, Van Son [1 ,3 ]
Le, Tung [2 ,3 ]
机构
[1] Univ Sci, Fac Math & Comp Sci, Ho Chi Minh, Vietnam
[2] Univ Sci, Fac Informat Technol, Ho Chi Minh, Vietnam
[3] Vietnam Natl Univ, Ho Chi Minh, Vietnam
关键词
Visual question answering; ViVQA; EfficientNet; BLIP-2; Convolutional;
D O I
10.1016/j.compeleceng.2024.109474
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Question Answering (VQA) has recently emerged as a potential research domain, captivating the interest of many in the field of artificial intelligence and computer vision. Despite the prevalence of approaches in English, there is a notable lack of systems specifically developed for certain languages, particularly Vietnamese. This study aims to bridge this gap by conducting comprehensive experiments on the Vietnamese Visual Question Answering (ViVQA) dataset, demonstrating the effectiveness of our proposed model. In response to community interest, we have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system. Therefore, we propose AViVQA-TranConI (Advancing A dvancing Vi etnamese V isual Q uestion A nswering with T ransformer and Con volutional I ntegration). AViVQA-TranConI integrates the Bootstrapping Language-Image Pre-training with frozen unimodal models (BLIP-2) and the convolutional neural network EfficientNet to extract and process both local and global features from images. This integration leverages the strengths of transformer-based architectures for capturing comprehensive contextual information and convolutional networks for detailed local features. By freezing the parameters of these pre-trained models, we significantly reduce the computational cost and training time, while maintaining high performance. This approach significantly improves image representation and enhances the performance of existing VQA systems. We then leverage a multi-modal fusion module based on a general-purpose multi-modal foundation model (BEiT-3) to fuse the information between visual and textual features. Our experimental findings demonstrate that AViVQA-TranConI surpasses competing baselines, achieving promising performance. This is particularly evident in its accuracy of 71.04% on the test set of the ViVQA dataset, marking a significant advancement in our research area. The code is available at https://github.com/nngocson2002/ViVQA.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] ST-VQA: shrinkage transformer with accurate alignment for visual question answering
    Haiying Xia
    Richeng Lan
    Haisheng Li
    Shuxiang Song
    Applied Intelligence, 2023, 53 : 20967 - 20978
  • [32] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [33] Dual-decoder transformer network for answer grounding in visual question answering
    Zhu, Liangjun
    Peng, Li
    Zhou, Weinan
    Yang, Jielong
    PATTERN RECOGNITION LETTERS, 2023, 171 : 53 - 60
  • [34] Positional Attention Guided Transformer-Like Architecture for Visual Question Answering
    Mao, Aihua
    Yang, Zhi
    Lin, Ken
    Xuan, Jun
    Liu, Yong-Jin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 6997 - 7009
  • [35] Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering
    Peng, Longkun
    An, Gaoyun
    Ruan, Qiuqi
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 120 - 123
  • [36] Object-difference drived graph convolutional networks for visual question answering
    Xi Zhu
    Zhendong Mao
    Zhineng Chen
    Yangyang Li
    Zhaohui Wang
    Bin Wang
    Multimedia Tools and Applications, 2021, 80 : 16247 - 16265
  • [37] Static Correlative Filter based Convolutional Neural Network for Visual Question Answering
    Chen, Lijun
    Li, Qinyu
    Wang, Hanli
    Long, Yu
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2018, : 526 - 529
  • [38] Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets
    Abdulganiyu Abdu Yusuf
    Feng Chong
    Mao Xianling
    Multimedia Tools and Applications, 2022, 81 : 40361 - 40370
  • [39] Object-difference drived graph convolutional networks for visual question answering
    Zhu, Xi
    Mao, Zhendong
    Chen, Zhineng
    Li, Yangyang
    Wang, Zhaohui
    Wang, Bin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16247 - 16265
  • [40] Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets
    Yusuf, Abdulganiyu Abdu
    Feng Chong
    Mao Xianling
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (28) : 40361 - 40370