Anatomical Structure-Guided Medical Vision-Language Pre-training

被引:0
|
作者
Li, Qingqiu [1 ]
Yan, Xiaohan [2 ]
Xu, Jilan [3 ]
Yuan, Runtian [3 ]
Zhang, Yuejie [3 ]
Feng, Rui [1 ,3 ]
Shen, Quanli [4 ]
Zhang, Xiaobo [4 ]
Wang, Shujun [5 ,6 ]
机构
[1] Fudan Univ, Sch Acad Engn & Technol, Shanghai, Peoples R China
[2] Tongji Univ, CAD Res Ctr, Shanghai, Peoples R China
[3] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[4] Fudan Univ, Childrens Hosp, Natl Childrens Med Ctr, Shanghai, Peoples R China
[5] Hong Kong Polytech Univ, Dept Biomed Engn, Hong Kong, Peoples R China
[6] Hong Kong Polytech Univ, Res Inst Smart Ageing, Hong Kong, Peoples R China
关键词
Representation Learning; Medical Vision-Language Pre-training; Contrastive Learning; Anatomical Structure;
D O I
10.1007/978-3-031-72120-5_8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning medical visual representations through vision-language pre-training has reached remarkable progress. Despite the promising performance, it still faces challenges, i.e., local alignment lacks interpretability and clinical relevance, and the insufficient internal and external representation learning of image-report pairs. To address these issues, we propose an Anatomical Structure-Guided (ASG) framework. Specifically, we parse raw reports into triplets <anatomical region, finding, existence>, and fully utilize each element as supervision to enhance representation learning. For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists, considering them as the minimum semantic units to explore fine-grained local alignment. For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample and constructing soft labels for contrastive learning to improve the semantic association of different image-report pairs. We evaluate the proposed ASG framework on two downstream tasks, including five public benchmarks. Experimental results demonstrate that our method outperforms the state-of-the-art methods. Our code is available at https://asgmvlp.github.io.
引用
收藏
页码:80 / 90
页数:11
相关论文
共 50 条
  • [41] VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
    Bao, Hangbo
    Wang, Wenhui
    Dong, Li
    Liu, Qiang
    Mohammed, Owais Khan
    Aggarwal, Kriti
    Som, Subhojit
    Piao, Songhao
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [42] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
  • [43] MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
    Ji, Yatai
    Wang, Junjie
    Gong, Yuan
    Zhang, Lin
    Zhu, Yanru
    Wang, Hongfa
    Zhang, Jiaxing
    Sakai, Tetsuya
    Yang, Yujiu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23262 - 23271
  • [44] Automated Bridge Inspection Image Interpretation Based on Vision-Language Pre-Training
    Wang, Shengyi
    El-Gohary, Nora
    COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 1 - 8
  • [45] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    arXiv, 2023,
  • [46] Leveraging per Image-Token Consistency for Vision-Language Pre-training
    Gou, Yunhao
    Ko, Tom
    Yang, Hansi
    Kwok, James
    Zhang, Yu
    Wang, Mingxuan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19155 - 19164
  • [47] GilBERT: Generative Vision-Language Pre-Training for Image-Text Retrieval
    Hong, Weixiang
    Ji, Kaixiang
    Liu, Jiajia
    Wang, Jian
    Chen, Jingdong
    Chu, Wei
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1379 - 1388
  • [48] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia, 2023, : 5120 - 5131
  • [49] Multimodal detection of hateful memes by applying a vision-language pre-training model
    Chen, Yuyang
    Pan, Feng
    PLOS ONE, 2022, 17 (09):
  • [50] Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
    Dai, Wenliang
    Liu, Zihan
    Ji, Ziwei
    Su, Dan
    Fung, Pascale
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2136 - 2148