Multi visual and textual embedding on visual question answering for blind people

被引:5
|
作者
Tung Le [1 ]
Huy Tien Nguyen [2 ,3 ,4 ]
Minh Le Nguyen [1 ]
机构
[1] Japan Adv Inst Sci & Technol JAIST, Nomi, Ishikawa, Japan
[2] Univ Sci, Fac Informat Technol, Ho Chi Minh, Vietnam
[3] Vietnam Natl Univ, Ho Chi Minh, Vietnam
[4] Vingrp Big Data Inst, Hanoi, Vietnam
关键词
Visual question answering; Multi-visual embedding; BERT; Stacked attention; Pre-trained model;
D O I
10.1016/j.neucom.2021.08.117
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual impairment community, especially blind people have a thirst for assistance from advanced technologies for understanding and answering the image. Through the development and intersection between vision and language, Visual Question Answering (VQA) is to predict an answer from a textual question on an image. It is essential and ideal to help blind people with capturing the image and answering their questions automatically. Traditional approaches often utilize the strength of convolution and recurrent networks, which requires a great effort for learning and optimizing. A key challenge in VQA is finding an effective way to extract and combine textual and visual features. To take advantage of previous knowledge in different domains, we propose BERT-RG, the delicate integration of pre-trained models into feature extractors, which relies on the interaction between residual and global features in the image and linguistic features in the question. Moreover, our architecture integrates a stacked attention mechanism that exploits the relationship between textual and visual objects. Specifically, the partial regions of images interact with partial keywords in question to enhance the text-vision representation. Besides, we also propose a novel perspective by considering a specific question type in VQA. Our proposal is significantly meaningful enough to develop a specialized system instead of putting forth the effort to dig for unlimited and unrealistic approaches. Experiments on VizWiz-VQA, a practical benchmark data set, show that our proposed model outperforms existing models on the VizWiz VQA dataset in the Yes/No question type. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页码:451 / 464
页数:14
相关论文
共 50 条
  • [21] LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING
    Liu, Fei
    Liu, Jing
    Fang, Zhiwei
    Lu, Hanqing
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3307 - 3311
  • [22] Multi-stage Attention based Visual Question Answering
    Mishra, Aakansha
    Anand, Ashish
    Guha, Prithwijit
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9407 - 9414
  • [23] Visual Question Answering using Explicit Visual Attention
    Lioutas, Vasileios
    Passalis, Nikolaos
    Tefas, Anastasios
    [J]. 2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2018,
  • [24] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [25] QUES-TO-VISUAL GUIDED VISUAL QUESTION ANSWERING
    Wu, Xiangyu
    Lu, Jianfeng
    Li, Zhuanfeng
    Xiong, Fengchao
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 4193 - 4197
  • [26] Exploiting hierarchical visual features for visual question answering
    Hong, Jongkwang
    Fu, Jianlong
    Uh, Youngjung
    Mei, Tao
    Byun, Hyeran
    [J]. NEUROCOMPUTING, 2019, 351 : 187 - 195
  • [27] PRIOR VISUAL RELATIONSHIP REASONING FOR VISUAL QUESTION ANSWERING
    Yang, Zhuoqian
    Qin, Zengchang
    Yu, Jing
    Wan, Tao
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1411 - 1415
  • [28] Fusing Multi-graph Structures for Visual Question Answering
    Hu, Yuncong
    Zhang, Ru
    Liu, Jianyi
    Yan, Dong
    [J]. ASIA-PACIFIC JOURNAL OF CLINICAL ONCOLOGY, 2023, 19 : 13 - 13
  • [29] Robust Explanations for Visual Question Answering
    Patro, Badri N.
    Patel, Shivansh
    Namboodiri, Vinay P.
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1566 - 1575
  • [30] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662