Parallel multi-head attention and term-weighted question embedding for medical visual question answering

被引:2
|
作者
Manmadhan, Sruthy [1 ,2 ]
Kovoor, Binsu C. [1 ]
机构
[1] Cochin Univ Sci & Technol, Div Informat Technol, Kochi 682022, Kerala, India
[2] NSS Coll Engn, Dept Comp Sci & Engn, Palakkad 678008, Kerala, India
关键词
Multi-head attention; Denoising autoencoder; Radiology images; Supervised term weighting; Visual question answering; VQA-RAD;
D O I
10.1007/s11042-023-14981-2
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The goal of medical visual question answering (Med-VQA) is to correctly answer a clinical question posed by a medical image. Medical images are fundamentally different from images in the general domain. As a result, using general domain Visual Question Answering (VQA) models to the medical domain is impossible. Furthermore, the large-scale data required by VQA models is rarely available in the medical arena. Existing approaches of medical visual question answering often rely on transfer learning with external data to generate good image feature representation and use cross-modal fusion of visual and language features to acclimate to the lack of labelled data. This research provides a new parallel multi-head attention framework (MaMVQA) for dealing with Med-VQA without the use of external data. The proposed framework addresses image feature extraction using the unsupervised Denoising Auto-Encoder (DAE) and language feature extraction using term-weighted question embedding. In addition, we present qf-MI, a unique supervised term-weighting (STW) scheme based on the concept of mutual information (MI) between the word and the corresponding class label. Extensive experimental findings on the VQA-RAD public medical VQA benchmark show that the proposed methodology outperforms previous state-of-the-art methods in terms of accuracy while requiring no external data to train the model. Remarkably, the presented MaMVQA model achieved significantly increased accuracy in predicting answers to both close-ended (78.68%) and open-ended (55.31%) questions. Also, an extensive set of ablations are studied to demonstrate the significance of individual components of the system.
引用
收藏
页码:34937 / 34958
页数:22
相关论文
共 50 条
  • [41] Multi-Channel Co-Attention Network for Visual Question Answering
    Tian, Weidong
    He, Bin
    Wang, Nanxun
    Zhao, Zhongqiu
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [42] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [43] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
    Guo, Zihan
    Han, Dezhi
    [J]. SENSORS, 2020, 20 (23) : 1 - 15
  • [44] Efficient Multi-step Reasoning Attention Network for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    Zhang, Meng
    [J]. THIRTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2021), 2022, 12083
  • [45] Multi-Modality Global Fusion Attention Network for Visual Question Answering
    Yang, Cheng
    Wu, Weijia
    Wang, Yuxing
    Zhou, Hong
    [J]. ELECTRONICS, 2020, 9 (11) : 1 - 12
  • [46] A Multi-level Mesh Mutual Attention Model for Visual Question Answering
    Zhi Lei
    Guixian Zhang
    Lijuan Wu
    Kui Zhang
    Rongjiao Liang
    [J]. Data Science and Engineering, 2022, 7 : 339 - 353
  • [47] Multi-scale Relational Reasoning with Regional Attention for Visual Question Answering
    Ma, Yuntao
    Lu, Tong
    Wu, Yirui
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5642 - 5649
  • [48] A Multi-level Mesh Mutual Attention Model for Visual Question Answering
    Lei, Zhi
    Zhang, Guixian
    Wu, Lijuan
    Zhang, Kui
    Liang, Rongjiao
    [J]. DATA SCIENCE AND ENGINEERING, 2022, 7 (04) : 339 - 353
  • [49] MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding
    Park, Geondo
    Han, Chihye
    Kim, Daeshik
    Yoon, Wonjun
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1507 - 1515
  • [50] Densely Connected Attention Flow for Visual Question Answering
    Liu, Fei
    Liu, Jing
    Fang, Zhiwei
    Hong, Richang
    Lu, Hanging
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 869 - 875