Anatomical Structure-Guided Medical Vision-Language Pre-training

被引：0

作者：

Li, Qingqiu ^{[1
]}

Yan, Xiaohan ^{[2
]}

Xu, Jilan ^{[3
]}

Yuan, Runtian ^{[3
]}

Zhang, Yuejie ^{[3
]}

Feng, Rui ^{[1
,3
]}

Shen, Quanli ^{[4
]}

Zhang, Xiaobo ^{[4
]}

Wang, Shujun ^{[5
,6
]}

机构：

[1] Fudan Univ, Sch Acad Engn & Technol, Shanghai, Peoples R China

[2] Tongji Univ, CAD Res Ctr, Shanghai, Peoples R China

[3] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China

[4] Fudan Univ, Childrens Hosp, Natl Childrens Med Ctr, Shanghai, Peoples R China

[5] Hong Kong Polytech Univ, Dept Biomed Engn, Hong Kong, Peoples R China

[6] Hong Kong Polytech Univ, Res Inst Smart Ageing, Hong Kong, Peoples R China

来源：

MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XI | 2024年 / 15011卷

关键词：

Representation Learning; Medical Vision-Language Pre-training; Contrastive Learning; Anatomical Structure;

D O I：

10.1007/978-3-031-72120-5_8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Learning medical visual representations through vision-language pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets <anatomical region, finding, existence>, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods. Our code is available at https://asgmvlp.github.io.

引用

页码：80 / 90

页数：11

共 50 条

[21] Superpixel semantics representation and pre-training for vision-language tasks
Zhang, Siyu
Chen, Yeming
Sun, Yaoru
Wang, Fang
Yang, Jun
Bai, Lizhi
Gao, Shangce
NEUROCOMPUTING, 2025, 615
[22] Too Large; Data Reduction for Vision-Language Pre-Training
Wang, Alex Jinpeng
Lin, Kevin Qinghong
Zhang, David Junhao
Lei, Stan Weixian
Shou, Mike Zheng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3124 - 3134
[23] Scaling Up Vision-Language Pre-training for Image Captioning
Hu, Xiaowei
Gan, Zhe
Wang, Jianfeng
Yang, Zhengyuan
Liu, Zicheng
Lu, Yumao
Wang, Lijuan
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17959 - 17968
[24] Towards Adversarial Attack on Vision-Language Pre-training Models
Zhang, Jiaming
Yi, Qi
Sang, Jitao
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5005 - 5013
[25] MAFA: Managing False Negatives for Vision-Language Pre-training
Byun, Jaeseok
Kim, Dohoon
Moon, Taesup
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27304 - 27314
[26] Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training
Zhou, Wenlve
Zhou, Zhiheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8201 - 8214
[27] Multimodal Pre-training Method for Vision-language Understanding and Generation
Liu T.-Y.
Wu Z.-X.
Chen J.-J.
Jiang Y.-G.
Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2024 - 2034
[28] Unified Vision-Language Pre-Training for Image Captioning and VQA
Zhou, Luowei
Palangi, Hamid
Zhang, Lei
Hu, Houdong
Corso, Jason J.
Gao, Jianfeng
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 13041 - 13049
[29] Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts
Wang, Alex Jinpeng
Zhou, Pan
Shou, Mike Zheng
Yan, Shuicheng
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (05) : 3406 - 3421
[30] Improving Medical Speech-to-Text Accuracy using Vision-Language Pre-training Models
Huh, Jaeyoung
Park, Sangjoon
Lee, Jeong Eun
Ye, Jong Chul
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (03) : 1692 - 1703

← 1 2 3 4 5 →