Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

被引：0

作者：

Chen, Zhihong ^{[1
,2
]}

Diao, Shizhe ^{[3
]}

Wang, Benyou ^{[1
,2
]}

Li, Guanbin ^{[4
]}

Wan, Xiang ^{[2
]}

机构：

[1] Chinese Univ Hong Kong, Shenzhen, Peoples R China

[2] Shenzhen Res Inst Big Data, Shenzhen, Peoples R China

[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[4] Sun Yat Sen Univ, Guangzhou, Peoples R China

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/ICCV51070.2023.02139

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical types, i.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. The former is superior at multi-modal tasks owing to the sufficient interaction between modalities; the latter is good at uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two types, we propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as DETR-like queries that assist in extracting features when one of the modalities is missing. By doing so, a single model could serve as a foundation model that processes various tasks adopting different input formats (i.e., image-only, text-only, and image-text-pair). Furthermore, we construct a prompt pool (instead of static ones) to improve diversity and scalability, enabling queries conditioned on different input instances. Experimental results show that our approach achieves competitive results on a broad range of tasks, spanning uni-modal tasks (i.e., image/text classification and text summarization), cross-modal tasks (i.e., imageto-text generation and image-text/text-image retrieval), and multi-modal tasks (i.e., visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of prompts is orthogonal to most existing Med-VLP approaches and could be a beneficial and complementary extension to these approaches.(1)

引用

页码：23346 / 23356

页数：11

共 50 条

[1] DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
Huang, Luyang
Niu, Guocheng
Liu, Jiachen
Xiao, Xinyan
Wu, Hua
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2552 - 2566
[2] Simultaneously Training and Compressing Vision-and-Language Pre-Training Model
Qi, Qiaosong
Zhang, Aixi
Liao, Yue
Sun, Wenyu
Wang, Yongliang
Li, Xiaobo
Liu, Si
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8194 - 8203
[3] Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
Chen, Zhihong
Li, Guanbin
Wan, Xiang
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5152 - 5161
[4] Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training
Chen, Zhihong
Du, Yuhao
Hu, Jinpeng
Liu, Yang
Li, Guanbin
Wan, Xiang
Chang, Tsung-Hui
[J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 679 - 689
[5] Weakly Supervised Vision-and-Language Pre-training with Relative Representations
Chen, Chi
Li, Peng
Sun, Maosong
Liu, Yang
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 8341 - 8355
[6] Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Li, Liunian Harold
You, Haoxuan
Wang, Zhecan
Zareian, Alireza
Chang, Shih-Fu
Chang, Kai-Wei
[J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5339 - 5350
[7] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
Wu, Siying
Fu, Xueyang
Wu, Feng
Zha, Zheng-Jun
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
[8] Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
Cui, Yibo
Xie, Liang
Zhang, Yakun
Zhang, Meishan
Yan, Ye
Yin, Erwei
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12009 - 12019
[9] HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation
Qiao, Yanyuan
Qi, Yuankai
Hong, Yicong
Yu, Zheng
Wang, Peng
Wu, Qi
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15397 - 15406
[10] Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
Zhou, Mingyang
Yu, Licheng
Singh, Amanpreet
Wang, Mengjiao
Yu, Zhou
Zhang, Ning
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16464 - 16473

← 1 2 3 4 5 →