Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

被引：39

作者：

Chen, Zhihong ^{[1
]}

Du, Yuhao ^{[1
]}

Hu, Jinpeng ^{[1
]}

Liu, Yang ^{[1
]}

Li, Guanbin ^{[2
]}

Wan, Xiang ^{[1
,3
]}

Chang, Tsung-Hui ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Shenzhen, Peoples R China

[2] Sun Yat Sen Univ, Guangzhou, Peoples R China

[3] Pazhou Lab, Guangzhou, Peoples R China

来源：

MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V | 2022年 / 13435卷

基金：

中国国家自然科学基金;

关键词：

Multi-modal pre-training; Masked autoencoders; Medical vision-and-language analysis;

D O I：

10.1007/978-3-031-16443-9_65

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (M(3)AE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of abstraction in visual and language. Third, we develop different designs for vision and language decoders (i.e., a Transformer for vision and a multi-layer perceptron for language). To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results demonstrate the effectiveness of our approach, where state-of-the-art results are achieved on all downstream tasks. Besides, we conduct further analysis to better verify the effectiveness of different components of our approach and various settings of pre-training. The source code is available at https://github.com/zhjohnchan/M3AE.

引用

页码：679 / 689

页数：11

共 50 条

[1] Multi-modal Pathological Pre-training via Masked Autoencoders for Breast Cancer Diagnosis
Lu, Mengkang
Wang, Tianyi
Xia, Yong
[J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VI, 2023, 14225 : 457 - 466
[2] Multi-modal Adapter for Medical Vision-and-Language Learning
Yu, Zheng
Qiao, Yanyuan
Xie, Yutong
Wu, Qi
[J]. MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2023, PT I, 2024, 14348 : 393 - 402
[3] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
Moon, Jong Hak
Lee, Hyungyung
Shin, Woncheol
Kim, Young-Hak
Choi, Edward
[J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (12) : 6070 - 6080
[4] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
Wu, Siying
Fu, Xueyang
Wu, Feng
Zha, Zheng-Jun
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
[5] Multi-modal Masked Pre-training for Monocular Panoramic Depth Completion
Yan, Zhiqiang
Li, Xiang
Wang, Kun
Zhang, Zhenyu
Li, Jun
Yang, Jian
[J]. COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 378 - 395
[6] Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
Xu, Li
Liu, Bo
Khan, Ameer Hamza
Fan, Lu
Wu, Xiao-Ming
[J]. CONFERENCE ON HEALTH, INFERENCE, AND LEARNING, VOL 209, 2023, 209 : 117 - +
[7] MGeo: Multi-Modal Geographic Language Model Pre-Training
Ding, Ruixue
Chen, Boli
Xie, Pengjun
Huang, Fei
Li, Xin
Zhang, Qiang
Xu, Yao
[J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 185 - 194
[8] Simultaneously Training and Compressing Vision-and-Language Pre-Training Model
Qi, Qiaosong
Zhang, Aixi
Liao, Yue
Sun, Wenyu
Wang, Yongliang
Li, Xiaobo
Liu, Si
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8194 - 8203
[9] Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
Chen, Zhihong
Li, Guanbin
Wan, Xiang
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5152 - 5161
[10] Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
Chen, Zhihong
Diao, Shizhe
Wang, Benyou
Li, Guanbin
Wan, Xiang
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 23346 - 23356

← 1 2 3 4 5 →