Multi visual and textual embedding on visual question answering for blind people

被引:5
|
作者
Tung Le [1 ]
Huy Tien Nguyen [2 ,3 ,4 ]
Minh Le Nguyen [1 ]
机构
[1] Japan Adv Inst Sci & Technol JAIST, Nomi, Ishikawa, Japan
[2] Univ Sci, Fac Informat Technol, Ho Chi Minh, Vietnam
[3] Vietnam Natl Univ, Ho Chi Minh, Vietnam
[4] Vingrp Big Data Inst, Hanoi, Vietnam
关键词
Visual question answering; Multi-visual embedding; BERT; Stacked attention; Pre-trained model;
D O I
10.1016/j.neucom.2021.08.117
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual impairment community, especially blind people have a thirst for assistance from advanced technologies for understanding and answering the image. Through the development and intersection between vision and language, Visual Question Answering (VQA) is to predict an answer from a textual question on an image. It is essential and ideal to help blind people with capturing the image and answering their questions automatically. Traditional approaches often utilize the strength of convolution and recurrent networks, which requires a great effort for learning and optimizing. A key challenge in VQA is finding an effective way to extract and combine textual and visual features. To take advantage of previous knowledge in different domains, we propose BERT-RG, the delicate integration of pre-trained models into feature extractors, which relies on the interaction between residual and global features in the image and linguistic features in the question. Moreover, our architecture integrates a stacked attention mechanism that exploits the relationship between textual and visual objects. Specifically, the partial regions of images interact with partial keywords in question to enhance the text-vision representation. Besides, we also propose a novel perspective by considering a specific question type in VQA. Our proposal is significantly meaningful enough to develop a specialized system instead of putting forth the effort to dig for unlimited and unrealistic approaches. Experiments on VizWiz-VQA, a practical benchmark data set, show that our proposed model outperforms existing models on the VizWiz VQA dataset in the Yes/No question type. (c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页码:451 / 464
页数:14
相关论文
共 50 条
  • [1] Visual Question Answering with Textual Representations for Images
    Hirota, Yusuke
    Garcia, Noa
    Otani, Mayu
    Chu, Chenhui
    Nakashima, Yuta
    Taniguchi, Ittetsu
    Onoye, Takao
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
  • [2] Visual-Textual Semantic Alignment Network for Visual Question Answering
    Tian, Weidong
    Zhang, Yuzheng
    He, Bin
    Zhu, Junjun
    Zhao, Zhongqiu
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 259 - 270
  • [3] Dynamic Memory Networks for Visual and Textual Question Answering
    Xiong, Caiming
    Merity, Stephen
    Socher, Richard
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
  • [4] An Enhanced Term Weighted Question Embedding for Visual Question Answering
    Manmadhan, Sruthy
    Kovoor, Binsu C.
    [J]. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2022, 21 (02)
  • [5] Multi-stage hybrid embedding fusion network for visual question answering
    Lao, Mingrui
    Guo, Yanming
    Pu, Nan
    Chen, Wei
    Liu, Yu
    Lew, Michael S.
    [J]. NEUROCOMPUTING, 2021, 423 : 541 - 550
  • [6] Multi-Question Learning for Visual Question Answering
    Lei, Chenyi
    Wu, Lei
    Liu, Dong
    Li, Zhao
    Wang, Guoxin
    Tang, Haihong
    Li, Houqiang
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11328 - 11335
  • [7] length Context-aware Multi-level Question Embedding Fusion for visual question answering
    Li, Shengdong
    Gong, Chen
    Zhu, Yuqing
    Luo, Chuanwen
    Hong, Yi
    Lv, Xueqiang
    [J]. INFORMATION FUSION, 2024, 102
  • [8] Embedding Spatial Relations in Visual Question Answering for Remote Sensing
    Faure, Maxime
    Lobry, Sylvain
    Kurtz, Camille
    Wendling, Laurent
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 310 - 316
  • [9] Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents
    Wang, Bo
    Xu, Youjiang
    Han, Yahong
    Hong, Richang
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7380 - 7387
  • [10] Parallel multi-head attention and term-weighted question embedding for medical visual question answering
    Manmadhan, Sruthy
    Kovoor, Binsu C.
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (22) : 34937 - 34958