Multimodal Attention for Visual Question Answering

被引：0

作者：

Kodra, Lorena ^{[1
]}

Mece, Elinda Kajo ^{[1
]}

机构：

[1] Polytech Univ Tirana, Tirana, Albania

来源：

INTELLIGENT COMPUTING, VOL 1 | 2019年 / 858卷

关键词：

Visual Question Answering (VQA); Multimodal attention mechanism; Convolutional Neural Networks (CNN); Recurrent Neural Networks (RNN); Long Short-Term Memory (LSTM);

D O I：

10.1007/978-3-030-01174-1_60

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual Question Answering (VQA) is a task where given an image and a natural language question about the image the aim is to provide an accurate natural language answer. In recent years a lot of work has been done in this area in order to address the challenges that this task presents and improve the accuracy of the models. One of the new concepts that have been recently introduced is the attention mechanism where the model focuses on specific parts of the input in order to generate the answer. In this paper, we present a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific words of the question in order to generate a more precise answer. We evaluate our proposed solution on the VQA dataset and show that it performs better compared with state-of-the-art models.

引用

页码：783 / 792

页数：10

共 50 条

[41] MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain
Dhruv Sharma
Sanjay Purushotham
Chandan K. Reddy
[J]. Scientific Reports, 11
[42] Integrating multimodal features by a two-way co-attention mechanism for visual question answering
Sharma, Himanshu
Srivastava, Swati
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59577 - 59595
[43] MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain
Sharma, Dhruv
Purushotham, Sanjay
Reddy, Chandan K.
[J]. SCIENTIFIC REPORTS, 2021, 11 (01)
[44] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
Li, Yong
Yang, Qihao
Wang, Fu Lee
Lee, Lap-Kei
Qu, Yingying
Hao, Tianyong
[J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
[45] Multimodal fusion: advancing medical visual question-answering
Mudgal, Anjali
Kush, Udbhav
Kumar, Aditya
Jafari, Amir
[J]. Neural Computing and Applications, 2024, 36 (33) : 20949 - 20962
[46] Multimodal Local Perception Bilinear Pooling for Visual Question Answering
Lao, Mingrui
Guo, Yanming
Wang, Hui
Zhang, Xin
[J]. IEEE ACCESS, 2018, 6 : 57923 - 57932
[47] Dual-Key Multimodal Backdoors for Visual Question Answering
Walmer, Matthew
Sikka, Karan
Sur, Indranil
Shrivastava, Abhinav
Jha, Susmit
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15354 - 15364
[48] From Pixels to Objects: Cubic Visual Attention for Visual Question Answering
Song, Jingkuan
Zeng, Pengpeng
Gao, Lianli
Shen, Heng Tao
[J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 906 - 912
[49] Contrastive training of a multimodal encoder for medical visual question answering
Silva, Joao Daniel
Martins, Bruno
Magalhaes, Joao
[J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 18
[50] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
Saqur, Raeid
Narasimhan, Karthik
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33

← 1 2 3 4 5 →