Multi visual and textual embedding on visual question answering for blind people

被引：5

作者：

Tung Le ^{[1
]}

Huy Tien Nguyen ^{[2
,3
,4
]}

Minh Le Nguyen ^{[1
]}

机构：

[1] Japan Adv Inst Sci & Technol JAIST, Nomi, Ishikawa, Japan

[2] Univ Sci, Fac Informat Technol, Ho Chi Minh, Vietnam

[3] Vietnam Natl Univ, Ho Chi Minh, Vietnam

[4] Vingrp Big Data Inst, Hanoi, Vietnam

来源：

NEUROCOMPUTING | 2021年 / 465卷

关键词：

Visual question answering; Multi-visual embedding; BERT; Stacked attention; Pre-trained model;

D O I：

10.1016/j.neucom.2021.08.117

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual impairment community, especially blind people have a thirst for assistance from advanced technologies for understanding and answering the image. Through the development and intersection between vision and language, Visual Question Answering (VQA) is to predict an answer from a textual question on an image. It is essential and ideal to help blind people with capturing the image and answering their questions automatically. Traditional approaches often utilize the strength of convolution and recurrent networks, which requires a great effort for learning and optimizing. A key challenge in VQA is finding an effective way to extract and combine textual and visual features. To take advantage of previous knowledge in different domains, we propose BERT-RG, the delicate integration of pre-trained models into feature extractors, which relies on the interaction between residual and global features in the image and linguistic features in the question. Moreover, our architecture integrates a stacked attention mechanism that exploits the relationship between textual and visual objects. Specifically, the partial regions of images interact with partial keywords in question to enhance the text-vision representation. Besides, we also propose a novel perspective by considering a specific question type in VQA. Our proposal is significantly meaningful enough to develop a specialized system instead of putting forth the effort to dig for unlimited and unrealistic approaches. Experiments on VizWiz-VQA, a practical benchmark data set, show that our proposed model outperforms existing models on the VizWiz VQA dataset in the Yes/No question type. (c) 2021 Elsevier B.V. All rights reserved.

引用

页码：451 / 464

页数：14

共 50 条

[31] Visual Question Answering for Cultural Heritage
Bongini, Pietro
Becattini, Federico
Bagdanov, Andrew D.
Del Bimbo, Alberto
[J]. INTERNATIONAL CONFERENCE FLORENCE HERI-TECH: THE FUTURE OF HERITAGE SCIENCE AND TECHNOLOGIES, 2020, 949
[32] Differential Attention for Visual Question Answering
Patro, Badri
Namboodiri, Vinay P.
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
[33] Affective Visual Question Answering Network
Ruwa, Nelson
Mao, Qirong
Wang, Liangjun
Dong, Ming
[J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 170 - 173
[34] Structured Attentions for Visual Question Answering
Zhu, Chen
Zhao, Yanpeng
Huang, Shuaiyi
Tu, Kewei
Ma, Yi
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1300 - 1309
[35] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
[J]. INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[36] An Analysis of Visual Question Answering Algorithms
Kafle, Kushal
Kanan, Christopher
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1983 - 1991
[37] Visual Question Answering as Reading Comprehension
Li, Hui
Wang, Peng
Shen, Chunhua
van den Hengel, Anton
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6312 - 6321
[38] Revisiting Visual Question Answering Baselines
Jabri, Allan
Joulin, Armand
van der Maaten, Laurens
[J]. COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 : 727 - 739
[39] Answer Distillation for Visual Question Answering
Fang, Zhiwei
Liu, Jing
Tang, Qu
Li, Yong
Lu, Hanqing
[J]. COMPUTER VISION - ACCV 2018, PT I, 2019, 11361 : 72 - 87
[40] Visual Question Answering on 360° Images
Chou, Shih-Han
Chao, Wei-Lun
Lai, Wei-Sheng
Sun, Min
Yang, Ming-Hsuan
[J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1596 - 1605

← 1 2 3 4 5 →