Medical visual question answering with symmetric interaction attention and cross-modal gating

被引：0

作者：

Chen, Zhi ^{[1
]}

Zou, Beiji ^{[1
]}

Dai, Yulan ^{[1
]}

Zhu, Chengzhang ^{[1
]}

Kong, Guilan ^{[2
]}

Zhang, Wensheng ^{[3
]}

机构：

[1] Cent South Univ, Sch Comp Sci & Engn, Changsha 410083, Peoples R China

[2] Peking Univ, Natl Inst Hlth Data Sci, Beijing 100871, Peoples R China

[3] Chinese Acad Sci, Inst Automat, Beijing 100190, Peoples R China

来源：

BIOMEDICAL SIGNAL PROCESSING AND CONTROL | 2023年 / 85卷

关键词：

Medical visual question answering; Self-attention; Information interaction; Cross-modal gating;

D O I：

10.1016/j.bspc.2023.105049

中图分类号：

R318 [生物医学工程];

学科分类号：

0831 ;

摘要：

The purpose of medical visual question answering (Med-VQA) is to provide accurate answers to clinical questions related to visual content of medical images. However, previous attempts neglect to take full advantage of the information interaction between medical images and clinical questions, which hinders the further progress of Med-VQA. The above issue requires the efforts to focus on critical information interaction within each modality and relevant information interaction between modalities. In this paper, we utilize the multiple meta-model quantifying model as visual encoder and the GloVe word embedding followed by the LSTM as textual encoder to form our feature extraction module. Then, we design a symmetric interaction attention module to construct dense and deep intra-and inter-modal information interaction on medical images and clinical questions for the Med-VQA task. Specifically, the symmetric interaction attention module consists of multiple symmetric interaction attention blocks that contain two basic units, i.e., self-attention and interaction attention. Technically, self-attention is introduced for intra-modal information interaction, while interaction attention is constructed for inter-modal information interaction. In addition, we develop a multi-modal fusion scheme that leverages the cross-modal gating to effectively fuse multi-modal information and avoid redundant information after sufficient intra-and inter-modal information interaction. Experimental results on the VQA-RAD dataset and PathVQA dataset show that our method outperforms other state-of-the-art Med-VQA models, achieving 74.7% and 48.7% on accuracy, 73.5% and 46.0% on F1-score, respectively.

引用

页数：10

共 50 条

[21] Deep medical cross-modal attention hashing
Zhang, Yong
Ou, Weihua
Shi, Yufeng
Deng, Jiaxin
You, Xinge
Wang, Anzhi
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1519 - 1536
[22] Utilizing visual attention for cross-modal coreference interpretation
Byron, D
Mampilly, T
Sharma, V
Xu, TF
[J]. MODELING AND USING CONTEXT, PROCEEDINGS, 2005, 3554 : 83 - 96
[23] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Cheng, Lei
Li, Zhoujun
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
[24] Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
Zhu, Zihao
Yu, Jing
Wang, Yujing
Sun, Yajing
Hu, Yue
Wu, Qi
[J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1097 - 1103
[25] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
Lyu, Chenyang
Li, Wenxi
Ji, Tianbo
Zhou, Liting
Gurrin, Cathal
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
[26] CroMIC-QA: The Cross-Modal Information Complementation Based Question Answering
Qian, Shun
Liu, Bingquan
Sun, Chengjie
Xu, Zhen
Ma, Lin
Wang, Baoxun
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8348 - 8359
[27] VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
Liu, Yang
Tan, Ying
Luo, Jingzhou
Chen, Weixing
[J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 309 - 322
[28] Cross-modal generality of the gating deficit
Edgar, JC
Miller, GA
Moses, SN
Thoma, RJ
Huang, MX
Hanlon, FM
Weisend, MP
Sherwood, A
Bustillo, J
Adler, LE
Cañive, JM
[J]. PSYCHOPHYSIOLOGY, 2005, 42 (03) : 318 - 327
[29] Cross-modal body representation based on visual attention by saliency
Hikita, Mai
Fuke, Sawa
Ogino, Masaki
Asada, Minoru
[J]. 2008 IEEE/RSJ INTERNATIONAL CONFERENCE ON ROBOTS AND INTELLIGENT SYSTEMS, VOLS 1-3, CONFERENCE PROCEEDINGS, 2008, : 2041 - +
[30] Multi-modal spatial relational attention networks for visual question answering
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Sun, Yuxin
Zhang, Zhi
Luo, Yongkang
[J]. IMAGE AND VISION COMPUTING, 2023, 140

← 1 2 3 4 5 →