Medical visual question answering with symmetric interaction attention and cross-modal gating

被引:0
|
作者
Chen, Zhi [1 ]
Zou, Beiji [1 ]
Dai, Yulan [1 ]
Zhu, Chengzhang [1 ]
Kong, Guilan [2 ]
Zhang, Wensheng [3 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Peoples R China
[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100871, Peoples R China
[3] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
关键词
Medical visual question answering; Self-attention; Information interaction; Cross-modal gating;
D O I
10.1016/j.bspc.2023.105049
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
The purpose of medical visual question answering (Med-VQA) is to provide accurate answers to clinical questions related to visual content of medical images. However, previous attempts neglect to take full advantage of the information interaction between medical images and clinical questions, which hinders the further progress of Med-VQA. The above issue requires the efforts to focus on critical information interaction within each modality and relevant information interaction between modalities. In this paper, we utilize the multiple meta-model quantifying model as visual encoder and the GloVe word embedding followed by the LSTM as textual encoder to form our feature extraction module. Then, we design a symmetric interaction attention module to construct dense and deep intra-and inter-modal information interaction on medical images and clinical questions for the Med-VQA task. Specifically, the symmetric interaction attention module consists of multiple symmetric interaction attention blocks that contain two basic units, i.e., self-attention and interaction attention. Technically, self-attention is introduced for intra-modal information interaction, while interaction attention is constructed for inter-modal information interaction. In addition, we develop a multi-modal fusion scheme that leverages the cross-modal gating to effectively fuse multi-modal information and avoid redundant information after sufficient intra-and inter-modal information interaction. Experimental results on the VQA-RAD dataset and PathVQA dataset show that our method outperforms other state-of-the-art Med-VQA models, achieving 74.7% and 48.7% on accuracy, 73.5% and 46.0% on F1-score, respectively.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Visual question answering with attention transfer and a cross-modal gating mechanism
    Li, Wei
    Sun, Jianhui
    Liu, Ge
    Zhao, Linglan
    Fang, Xiangzhong
    [J]. PATTERN RECOGNITION LETTERS, 2020, 133 (133) : 334 - 340
  • [2] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
    Li, Yong
    Yang, Qihao
    Wang, Fu Lee
    Lee, Lap-Kei
    Qu, Yingying
    Hao, Tianyong
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [3] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    [J]. IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [4] Cross-Modal Visual Question Answering for Remote Sensing Data
    Felix, Rafael
    Repasky, Boris
    Hodge, Samuel
    Zolfaghari, Reza
    Abbasnejad, Ehsan
    Sherrah, Jamie
    [J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
  • [5] Cross-modal Relational Reasoning Network for Visual Question Answering
    Chen, Hongyu
    Liu, Ruifang
    Peng, Bo
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
  • [6] Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
    Gong, Haifan
    Chen, Guanqi
    Liu, Sishuo
    Yu, Yizhou
    Li, Guanbin
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 456 - 460
  • [7] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    [J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
  • [8] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
    Yu, Jing
    Zhang, Weifeng
    Lu, Yuhang
    Qin, Zengchang
    Hu, Yue
    Tan, Jianlong
    Wu, Qi
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
  • [9] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    [J]. PATTERN RECOGNITION, 2020, 108
  • [10] Cross-Modal Dense Passage Retrieval for Outside Knowledge Visual Question Answering
    Reichman, Benjamin
    Heck, Larry
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2829 - 2834