Advancing Vietnamese Visual Question Answering with Transformer and Convolutional

被引：0

作者：

Nguyen, Ngoc Son ^{[1
,3
]}

Nguyen, Van Son ^{[1
,3
]}

Le, Tung ^{[2
,3
]}

机构：

[1] Univ Sci, Fac Math & Comp Sci, Ho Chi Minh, Vietnam

[2] Univ Sci, Fac Informat Technol, Ho Chi Minh, Vietnam

[3] Vietnam Natl Univ, Ho Chi Minh, Vietnam

来源：

COMPUTERS & ELECTRICAL ENGINEERING | 2024年 / 119卷

关键词：

Visual question answering; ViVQA; EfficientNet; BLIP-2; Convolutional;

D O I：

10.1016/j.compeleceng.2024.109474

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual Question Answering (VQA) has recently emerged as a potential research domain, captivating the interest of many in the field of artificial intelligence and computer vision. Despite the prevalence of approaches in English, there is a notable lack of systems specifically developed for certain languages, particularly Vietnamese. This study aims to bridge this gap by conducting comprehensive experiments on the Vietnamese Visual Question Answering (ViVQA) dataset, demonstrating the effectiveness of our proposed model. In response to community interest, we have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system. Therefore, we propose AViVQA-TranConI (Advancing A dvancing Vi etnamese V isual Q uestion A nswering with T ransformer and Con volutional I ntegration). AViVQA-TranConI integrates the Bootstrapping Language-Image Pre-training with frozen unimodal models (BLIP-2) and the convolutional neural network EfficientNet to extract and process both local and global features from images. This integration leverages the strengths of transformer-based architectures for capturing comprehensive contextual information and convolutional networks for detailed local features. By freezing the parameters of these pre-trained models, we significantly reduce the computational cost and training time, while maintaining high performance. This approach significantly improves image representation and enhances the performance of existing VQA systems. We then leverage a multi-modal fusion module based on a general-purpose multi-modal foundation model (BEiT-3) to fuse the information between visual and textual features. Our experimental findings demonstrate that AViVQA-TranConI surpasses competing baselines, achieving promising performance. This is particularly evident in its accuracy of 71.04% on the test set of the ViVQA dataset, marking a significant advancement in our research area. The code is available at https://github.com/nngocson2002/ViVQA.

引用

页数：18

共 50 条

[1] LiGT: layout-infused generative transformer for visual question answering on Vietnamese receipts
Le, Thanh-Phong
Phan, Trung Le Chi
Nguyen, Nghia Hieu
Van Nguyen, Kiet
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2025,
[2] A Vietnamese Question Answering System
Dai Quoc Nguyen
Dat Quoc Nguyen
Son Bao Pham
INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2009), 2009, : 26 - 32
[3] Multimodal fusion: advancing medical visual question-answering
Mudgal, Anjali
Kush, Udbhav
Kumar, Aditya
Jafari, Amir
Neural Computing and Applications, 2024, 36 (33) : 20949 - 20962
[4] Question Analysis for Vietnamese Legal Question Answering
Ngo Xuan Bach
Le Thi Ngoc Cham
Tran Ha Ngoc Thien
Tu Minh Phuong
2017 9TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2017), 2017, : 154 - 159
[5] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
Yu, Zhou
Jin, Zitian
Yu, Jun
Xu, Mingliang
Wang, Hongbo
Fan, Jianping
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556
[6] Local self-attention in transformer for visual question answering
Xiang Shen
Dezhi Han
Zihan Guo
Chongqing Chen
Jie Hua
Gaofeng Luo
Applied Intelligence, 2023, 53 : 16706 - 16723
[7] A Transformer-based Medical Visual Question Answering Model
Liu, Lei
Su, Xiangdong
Guo, Hui
Zhu, Daobin
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1712 - 1718
[8] Local self-attention in transformer for visual question answering
Shen, Xiang
Han, Dezhi
Guo, Zihan
Chen, Chongqing
Hua, Jie
Luo, Gaofeng
APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
[9] VISION AND TEXT TRANSFORMER FOR PREDICTING ANSWERABILITY ON VISUAL QUESTION ANSWERING
Le, Tung
Huy Tien Nguyen
Minh Le Nguyen
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 934 - 938
[10] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
Zhou, Yiyi
Ren, Tianhe
Zhu, Chaoyang
Sun, Xiaoshuai
Liu, Jianzhuang
Ding, Xinghao
Xu, Mingliang
Ji, Rongrong
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064

← 1 2 3 4 5 →