Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

被引:0
|
作者
Chen, Zhihong [1 ,2 ]
Diao, Shizhe [3 ]
Wang, Benyou [1 ,2 ]
Li, Guanbin [4 ]
Wan, Xiang [2 ]
机构
[1] Chinese Univ Hong Kong, Shenzhen, Peoples R China
[2] Shenzhen Res Inst Big Data, Shenzhen, Peoples R China
[3] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[4] Sun Yat Sen Univ, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV51070.2023.02139
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical types, i.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. The former is superior at multi-modal tasks owing to the sufficient interaction between modalities; the latter is good at uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two types, we propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as DETR-like queries that assist in extracting features when one of the modalities is missing. By doing so, a single model could serve as a foundation model that processes various tasks adopting different input formats (i.e., image-only, text-only, and image-text-pair). Furthermore, we construct a prompt pool (instead of static ones) to improve diversity and scalability, enabling queries conditioned on different input instances. Experimental results show that our approach achieves competitive results on a broad range of tasks, spanning uni-modal tasks (i.e., image/text classification and text summarization), cross-modal tasks (i.e., imageto-text generation and image-text/text-image retrieval), and multi-modal tasks (i.e., visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of prompts is orthogonal to most existing Med-VLP approaches and could be a beneficial and complementary extension to these approaches.(1)
引用
收藏
页码:23346 / 23356
页数:11
相关论文
共 50 条
  • [1] DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
    Huang, Luyang
    Niu, Guocheng
    Liu, Jiachen
    Xiao, Xinyan
    Wu, Hua
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2552 - 2566
  • [2] Simultaneously Training and Compressing Vision-and-Language Pre-Training Model
    Qi, Qiaosong
    Zhang, Aixi
    Liao, Yue
    Sun, Wenyu
    Wang, Yongliang
    Li, Xiaobo
    Liu, Si
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8194 - 8203
  • [3] Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge
    Chen, Zhihong
    Li, Guanbin
    Wan, Xiang
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5152 - 5161
  • [4] Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training
    Chen, Zhihong
    Du, Yuhao
    Hu, Jinpeng
    Liu, Yang
    Li, Guanbin
    Wan, Xiang
    Chang, Tsung-Hui
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT V, 2022, 13435 : 679 - 689
  • [5] Weakly Supervised Vision-and-Language Pre-training with Relative Representations
    Chen, Chi
    Li, Peng
    Sun, Maosong
    Liu, Yang
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 8341 - 8355
  • [6] Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
    Li, Liunian Harold
    You, Haoxuan
    Wang, Zhecan
    Zareian, Alireza
    Chang, Shih-Fu
    Chang, Kai-Wei
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5339 - 5350
  • [7] Cross-modal Semantic Alignment Pre-training for Vision-and-Language Navigation
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4233 - 4241
  • [8] Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation
    Cui, Yibo
    Xie, Liang
    Zhang, Yakun
    Zhang, Meishan
    Yan, Ye
    Yin, Erwei
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 12009 - 12019
  • [9] HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation
    Qiao, Yanyuan
    Qi, Yuankai
    Hong, Yicong
    Yu, Zheng
    Wang, Peng
    Wu, Qi
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15397 - 15406
  • [10] Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment
    Zhou, Mingyang
    Yu, Licheng
    Singh, Amanpreet
    Wang, Mengjiao
    Yu, Zhou
    Zhang, Ning
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 16464 - 16473