Multimodal Attention for Visual Question Answering

被引:0
|
作者
Kodra, Lorena [1 ]
Mece, Elinda Kajo [1 ]
机构
[1] Polytech Univ Tirana, Tirana, Albania
来源
关键词
Visual Question Answering (VQA); Multimodal attention mechanism; Convolutional Neural Networks (CNN); Recurrent Neural Networks (RNN); Long Short-Term Memory (LSTM);
D O I
10.1007/978-3-030-01174-1_60
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) is a task where given an image and a natural language question about the image the aim is to provide an accurate natural language answer. In recent years a lot of work has been done in this area in order to address the challenges that this task presents and improve the accuracy of the models. One of the new concepts that have been recently introduced is the attention mechanism where the model focuses on specific parts of the input in order to generate the answer. In this paper, we present a novel LSTM architecture for VQA that uses multimodal attention to focus over specific parts of the image and also on specific words of the question in order to generate a more precise answer. We evaluate our proposed solution on the VQA dataset and show that it performs better compared with state-of-the-art models.
引用
收藏
页码:783 / 792
页数:10
相关论文
共 50 条
  • [41] MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain
    Dhruv Sharma
    Sanjay Purushotham
    Chandan K. Reddy
    [J]. Scientific Reports, 11
  • [42] Integrating multimodal features by a two-way co-attention mechanism for visual question answering
    Sharma, Himanshu
    Srivastava, Swati
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59577 - 59595
  • [43] MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain
    Sharma, Dhruv
    Purushotham, Sanjay
    Reddy, Chandan K.
    [J]. SCIENTIFIC REPORTS, 2021, 11 (01)
  • [44] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
    Li, Yong
    Yang, Qihao
    Wang, Fu Lee
    Lee, Lap-Kei
    Qu, Yingying
    Hao, Tianyong
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [45] Multimodal fusion: advancing medical visual question-answering
    Mudgal, Anjali
    Kush, Udbhav
    Kumar, Aditya
    Jafari, Amir
    [J]. Neural Computing and Applications, 2024, 36 (33) : 20949 - 20962
  • [46] Multimodal Local Perception Bilinear Pooling for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    [J]. IEEE ACCESS, 2018, 6 : 57923 - 57932
  • [47] Dual-Key Multimodal Backdoors for Visual Question Answering
    Walmer, Matthew
    Sikka, Karan
    Sur, Indranil
    Shrivastava, Abhinav
    Jha, Susmit
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15354 - 15364
  • [48] From Pixels to Objects: Cubic Visual Attention for Visual Question Answering
    Song, Jingkuan
    Zeng, Pengpeng
    Gao, Lianli
    Shen, Heng Tao
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 906 - 912
  • [49] Contrastive training of a multimodal encoder for medical visual question answering
    Silva, Joao Daniel
    Martins, Bruno
    Magalhaes, Joao
    [J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 18
  • [50] Multimodal Graph Networks for Compositional Generalization in Visual Question Answering
    Saqur, Raeid
    Narasimhan, Karthik
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33