Multimodal Attention for Visual Question Answering

被引：0

作者：

Kodra, Lorena ^{[1
]}

Mece, Elinda Kajo ^{[1
]}

机构：

[1] Polytech Univ Tirana, Tirana, Albania

来源：

INTELLIGENT COMPUTING, VOL 1 | 2019年 / 858卷

关键词：

Visual Question Answering (VQA); Multimodal attention mechanism; Convolutional Neural Networks (CNN); Recurrent Neural Networks (RNN); Long Short-Term Memory (LSTM);

D O I：

10.1007/978-3-030-01174-1_60

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a task where given an image and a natural language question about the image the aim is to provide an accurate natural language answer. In recent years a lot of work has been done in this area in order to address the challenges that this task presents and improve the accuracy of the models. One of the new concepts that have been recently introduced is the attention mechanism where the model focuses on specific parts of the input in order to generate the answer. In this paper, we present a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific words of the question in order to generate a more precise answer. We evaluate our proposed solution on the VQA dataset and show that it performs better compared with state-of-the-art models.

引用

页码：783 / 792

页数：10

共 50 条

[1] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
Mahamoud, Ibrahim Souleiman
Coustaty, Mickael
Joseph, Aurelie
d'Andecy, Vincent Poulain
Ogier, Jean-Marc
[J]. DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
[2] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
Li, Haiyan
Han, Dezhi
[J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
[3] Multimodal attention-driven visual question answering for Malayalam
Abhishek Gopinath Kovath
Anand Nayyar
O. K. Sikha
[J]. Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
[4] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
Chen, Chongqing
Han, Dezhi
Wang, Jun
[J]. IEEE ACCESS, 2020, 8 : 35662 - 35671
[5] Multimodal feature fusion by relational reasoning and attention for visual question answering
Zhang, Weifeng
Yu, Jing
Hu, Hua
Hu, Haiyang
Qin, Zengchang
[J]. INFORMATION FUSION, 2020, 55 : 116 - 126
[6] Multimodal Cross-guided Attention Networks for Visual Question Answering
Liu, Haibin
Gong, Shengrong
Ji, Yi
Yang, Jianyu
Xing, Tengfei
Liu, Chunping
[J]. PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353
[7] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
Cai, Linqin
Xu, Nuoying
Tian, Hang
Chen, Kejia
Fan, Haodu
[J]. NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11921 - 11943
[8] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
Linqin Cai
Nuoying Xu
Hang Tian
Kejia Chen
Haodu Fan
[J]. Neural Processing Letters, 2023, 55 : 11921 - 11943
[9] An Improved Attention for Visual Question Answering
Rahman, Tanzila
Chou, Shih-Han
Sigal, Leonid
Carenini, Giuseppe
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
[10] Differential Attention for Visual Question Answering
Patro, Badri
Namboodiri, Vinay P.
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688

← 1 2 3 4 5 →