Multimodal Attention for Visual Question Answering

被引:0
|
作者
Kodra, Lorena [1 ]
Mece, Elinda Kajo [1 ]
机构
[1] Polytech Univ Tirana, Tirana, Albania
来源
关键词
Visual Question Answering (VQA); Multimodal attention mechanism; Convolutional Neural Networks (CNN); Recurrent Neural Networks (RNN); Long Short-Term Memory (LSTM);
D O I
10.1007/978-3-030-01174-1_60
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a task where given an image and a natural language question about the image the aim is to provide an accurate natural language answer. In recent years a lot of work has been done in this area in order to address the challenges that this task presents and improve the accuracy of the models. One of the new concepts that have been recently introduced is the attention mechanism where the model focuses on specific parts of the input in order to generate the answer. In this paper, we present a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific words of the question in order to generate a more precise answer. We evaluate our proposed solution on the VQA dataset and show that it performs better compared with state-of-the-art models.
引用
收藏
页码:783 / 792
页数:10
相关论文
共 50 条
  • [1] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
    Mahamoud, Ibrahim Souleiman
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    [J]. DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
  • [2] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
    Li, Haiyan
    Han, Dezhi
    [J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
  • [3] Multimodal attention-driven visual question answering for Malayalam
    Abhishek Gopinath Kovath
    Anand Nayyar
    O. K. Sikha
    [J]. Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
  • [4] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    [J]. IEEE ACCESS, 2020, 8 : 35662 - 35671
  • [5] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    [J]. INFORMATION FUSION, 2020, 55 : 116 - 126
  • [6] Multimodal Cross-guided Attention Networks for Visual Question Answering
    Liu, Haibin
    Gong, Shengrong
    Ji, Yi
    Yang, Jianyu
    Xing, Tengfei
    Liu, Chunping
    [J]. PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON COMPUTER MODELING, SIMULATION AND ALGORITHM (CMSA 2018), 2018, 151 : 347 - 353
  • [7] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
    Cai, Linqin
    Xu, Nuoying
    Tian, Hang
    Chen, Kejia
    Fan, Haodu
    [J]. NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11921 - 11943
  • [8] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
    Linqin Cai
    Nuoying Xu
    Hang Tian
    Kejia Chen
    Haodu Fan
    [J]. Neural Processing Letters, 2023, 55 : 11921 - 11943
  • [9] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
  • [10] Differential Attention for Visual Question Answering
    Patro, Badri
    Namboodiri, Vinay P.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688