Medical visual question answering with symmetric interaction attention and cross-modal gating

被引:0
|
作者
Chen, Zhi [1 ]
Zou, Beiji [1 ]
Dai, Yulan [1 ]
Zhu, Chengzhang [1 ]
Kong, Guilan [2 ]
Zhang, Wensheng [3 ]
机构
[1] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Peoples R China
[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100871, Peoples R China
[3] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China
关键词
Medical visual question answering; Self-attention; Information interaction; Cross-modal gating;
D O I
10.1016/j.bspc.2023.105049
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
The purpose of medical visual question answering (Med-VQA) is to provide accurate answers to clinical questions related to visual content of medical images. However, previous attempts neglect to take full advantage of the information interaction between medical images and clinical questions, which hinders the further progress of Med-VQA. The above issue requires the efforts to focus on critical information interaction within each modality and relevant information interaction between modalities. In this paper, we utilize the multiple meta-model quantifying model as visual encoder and the GloVe word embedding followed by the LSTM as textual encoder to form our feature extraction module. Then, we design a symmetric interaction attention module to construct dense and deep intra-and inter-modal information interaction on medical images and clinical questions for the Med-VQA task. Specifically, the symmetric interaction attention module consists of multiple symmetric interaction attention blocks that contain two basic units, i.e., self-attention and interaction attention. Technically, self-attention is introduced for intra-modal information interaction, while interaction attention is constructed for inter-modal information interaction. In addition, we develop a multi-modal fusion scheme that leverages the cross-modal gating to effectively fuse multi-modal information and avoid redundant information after sufficient intra-and inter-modal information interaction. Experimental results on the VQA-RAD dataset and PathVQA dataset show that our method outperforms other state-of-the-art Med-VQA models, achieving 74.7% and 48.7% on accuracy, 73.5% and 46.0% on F1-score, respectively.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Deep medical cross-modal attention hashing
    Zhang, Yong
    Ou, Weihua
    Shi, Yufeng
    Deng, Jiaxin
    You, Xinge
    Wang, Anzhi
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1519 - 1536
  • [22] Utilizing visual attention for cross-modal coreference interpretation
    Byron, D
    Mampilly, T
    Sharma, V
    Xu, TF
    [J]. MODELING AND USING CONTEXT, PROCEEDINGS, 2005, 3554 : 83 - 96
  • [23] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [24] Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
    Zhu, Zihao
    Yu, Jing
    Wang, Yujing
    Sun, Yajing
    Hu, Yue
    Wu, Qi
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1097 - 1103
  • [25] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Zhou, Liting
    Gurrin, Cathal
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
  • [26] CroMIC-QA: The Cross-Modal Information Complementation Based Question Answering
    Qian, Shun
    Liu, Bingquan
    Sun, Chengjie
    Xu, Zhen
    Ma, Lin
    Wang, Baoxun
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8348 - 8359
  • [27] VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
    Liu, Yang
    Tan, Ying
    Luo, Jingzhou
    Chen, Weixing
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 309 - 322
  • [28] Cross-modal generality of the gating deficit
    Edgar, JC
    Miller, GA
    Moses, SN
    Thoma, RJ
    Huang, MX
    Hanlon, FM
    Weisend, MP
    Sherwood, A
    Bustillo, J
    Adler, LE
    Cañive, JM
    [J]. PSYCHOPHYSIOLOGY, 2005, 42 (03) : 318 - 327
  • [29] Cross-modal body representation based on visual attention by saliency
    Hikita, Mai
    Fuke, Sawa
    Ogino, Masaki
    Asada, Minoru
    [J]. 2008 IEEE/RSJ INTERNATIONAL CONFERENCE ON ROBOTS AND INTELLIGENT SYSTEMS, VOLS 1-3, CONFERENCE PROCEEDINGS, 2008, : 2041 - +
  • [30] Multi-modal spatial relational attention networks for visual question answering
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Sun, Yuxin
    Zhang, Zhi
    Luo, Yongkang
    [J]. IMAGE AND VISION COMPUTING, 2023, 140