Multi visual and textual embedding on visual question answering for blind people

被引:5
|
作者
Tung Le [1 ]
Huy Tien Nguyen [2 ,3 ,4 ]
Minh Le Nguyen [1 ]
机构
[1] Japan Adv Inst Sci & Technol JAIST, Nomi, Ishikawa, Japan
[2] Univ Sci, Fac Informat Technol, Ho Chi Minh, Vietnam
[3] Vietnam Natl Univ, Ho Chi Minh, Vietnam
[4] Vingrp Big Data Inst, Hanoi, Vietnam
关键词
Visual question answering; Multi-visual embedding; BERT; Stacked attention; Pre-trained model;
D O I
10.1016/j.neucom.2021.08.117
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual impairment community, especially blind people have a thirst for assistance from advanced technologies for understanding and answering the image. Through the development and intersection between vision and language, Visual Question Answering (VQA) is to predict an answer from a textual question on an image. It is essential and ideal to help blind people with capturing the image and answering their questions automatically. Traditional approaches often utilize the strength of convolution and recurrent networks, which requires a great effort for learning and optimizing. A key challenge in VQA is finding an effective way to extract and combine textual and visual features. To take advantage of previous knowledge in different domains, we propose BERT-RG, the delicate integration of pre-trained models into feature extractors, which relies on the interaction between residual and global features in the image and linguistic features in the question. Moreover, our architecture integrates a stacked attention mechanism that exploits the relationship between textual and visual objects. Specifically, the partial regions of images interact with partial keywords in question to enhance the text-vision representation. Besides, we also propose a novel perspective by considering a specific question type in VQA. Our proposal is significantly meaningful enough to develop a specialized system instead of putting forth the effort to dig for unlimited and unrealistic approaches. Experiments on VizWiz-VQA, a practical benchmark data set, show that our proposed model outperforms existing models on the VizWiz VQA dataset in the Yes/No question type. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页码:451 / 464
页数:14
相关论文
共 50 条
  • [41] Medical visual question answering: A survey
    Lin, Zhihong
    Zhang, Donghao
    Tao, Qingyi
    Shi, Danli
    Haffari, Gholamreza
    Wu, Qi
    He, Mingguang
    Ge, Zongyuan
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 143
  • [42] Chain of Reasoning for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Dong, Xuan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [43] Dual Path Multi-Modal High-Order Features for Textual Content based Visual Question Answering
    Li, Yanan
    Lin, Yuetan
    Zhao, Honghui
    Wang, Donghui
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 4324 - 4331
  • [44] iVQA: Inverse Visual Question Answering
    Liu, Feng
    Xiang, Tao
    Hospedales, Timothy M.
    Yang, Wankou
    Sun, Changyin
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8611 - 8619
  • [45] Multitask Learning for Visual Question Answering
    Ma, Jie
    Liu, Jun
    Lin, Qika
    Wu, Bei
    Wang, Yaxian
    You, Yang
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1380 - 1394
  • [46] Visual Question Answering for Intelligent Interaction
    Gao, Panpan
    Sun, Hanxu
    Chen, Gang
    Wang, Ruiquan
    Li, Minggang
    [J]. MOBILE INFORMATION SYSTEMS, 2022, 2022
  • [47] Differential Networks for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Li, Ruifan
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8997 - 9004
  • [48] VAQA: Visual Arabic Question Answering
    Kamel, Sarah M. M.
    Hassan, Shimaa I. I.
    Elrefaei, Lamiaa
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2023, 48 (08) : 10803 - 10823
  • [49] Adapted GooLeNet for Visual Question Answering
    Huang, Jie
    Hu, Yue
    Yang, Weilong
    [J]. 2018 3RD INTERNATIONAL CONFERENCE ON MECHANICAL, CONTROL AND COMPUTER ENGINEERING (ICMCCE), 2018, : 603 - 606
  • [50] VAQA: Visual Arabic Question Answering
    Sarah M. kamel
    Shimaa I. Hassan
    Lamiaa Elrefaei
    [J]. Arabian Journal for Science and Engineering, 2023, 48 : 10803 - 10823