Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

被引:39
|
作者
Chen, Zhihong [1 ]
Du, Yuhao [1 ]
Hu, Jinpeng [1 ]
Liu, Yang [1 ]
Li, Guanbin [2 ]
Wan, Xiang [1 ,3 ]
Chang, Tsung-Hui [1 ]
机构
[1] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
[2] Sun Yat Sen Univ, Guangzhou, Peoples R China
[3] Pazhou Lab, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal pre-training; Masked autoencoders; Medical vision-and-language analysis;
D O I
10.1007/978-3-031-16443-9_65
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (M(3)AE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of abstraction in visual and language. Third, we develop different designs for vision and language decoders (i.e., a Transformer for vision and a multi-layer perceptron for language). To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results demonstrate the effectiveness of our approach, where state-of-the-art results are achieved on all downstream tasks. Besides, we conduct further analysis to better verify the effectiveness of different components of our approach and various settings of pre-training. The source code is available at https://github.com/zhjohnchan/M3AE.
引用
收藏
页码:679 / 689
页数:11
相关论文
共 50 条
  • [1] Multi-modal Pathological Pre-training via Masked Autoencoders for Breast Cancer Diagnosis
    Lu, Mengkang
    Wang, Tianyi
    Xia, Yong
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VI, 2023, 14225 : 457 - 466
  • [2] Multi-modal Adapter for Medical Vision-and-Language Learning
    Yu, Zheng
    Qiao, Yanyuan
    Xie, Yutong
    Wu, Qi
    [J]. MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I, 2024, 14348 : 393 - 402
  • [3] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
    Moon, Jong Hak
    Lee, Hyungyung
    Shin, Woncheol
    Kim, Young-Hak
    Choi, Edward
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (12) : 6070 - 6080
  • [4] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
  • [5] Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion
    Yan, Zhiqiang
    Li, Xiang
    Wang, Kun
    Zhang, Zhenyu
    Li, Jun
    Yang, Jian
    [J]. COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 378 - 395
  • [6] Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
    Xu, Li
    Liu, Bo
    Khan, Ameer Hamza
    Fan, Lu
    Wu, Xiao-Ming
    [J]. CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 209, 2023, 209 : 117 - +
  • [7] MGeo: Multi-Modal Geographic Language Model Pre-Training
    Ding, Ruixue
    Chen, Boli
    Xie, Pengjun
    Huang, Fei
    Li, Xin
    Zhang, Qiang
    Xu, Yao
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
  • [8] Simultaneously Training and Compressing Vision-and-Language Pre-Training Model
    Qi, Qiaosong
    Zhang, Aixi
    Liao, Yue
    Sun, Wenyu
    Wang, Yongliang
    Li, Xiaobo
    Liu, Si
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8194 - 8203
  • [9] Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
    Chen, Zhihong
    Li, Guanbin
    Wan, Xiang
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5152 - 5161
  • [10] Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
    Chen, Zhihong
    Diao, Shizhe
    Wang, Benyou
    Li, Guanbin
    Wan, Xiang
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 23346 - 23356