Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment

被引:0
|
作者
Chen, Zhihao [1 ]
Zhou, Yang [2 ]
Tran, Anh [3 ]
Zhao, Junting [1 ]
wan, Liang [1 ]
Ooi, Gideon Su Kai [4 ]
Cheng, Lionel Tim-Ee [3 ]
Thng, Choon Hua [4 ]
Xu, Xinxing [2 ]
Liu, Yong [2 ]
Fu, Huazhu [2 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] ASTAR, Inst High Performance Comp IHPC, 1 Fusionopolis Way 16-16 Connexis, Singapore 138632, Singapore
[3] Singapore Gen Hosp, Singapore, Singapore
[4] Natl Canc Ctr Singapore, Singapore, Singapore
基金
新加坡国家研究基金会;
关键词
Medical phrase grounding; vision-language model; contrastive learning;
D O I
10.1007/978-3-031-43990-2_35
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Medical phrase grounding (MPG) aims to locate the most relevant region in a medical image, given a phrase query describing certain medical findings, which is an important task for medical image analysis and radiological diagnosis. However, existing visual grounding methods rely on general visual features for identifying objects in natural images and are not capable of capturing the subtle and specialized features of medical findings, leading to a sub-optimal performance in MPG. In this paper, we propose MedRPG, an end-to-end approach for MPG. MedRPG is built on a lightweight vision-language transformer encoder and directly predicts the box coordinates of mentioned medical findings, which can be trained with limited medical data, making it a valuable tool in medical image analysis. To enable MedRPG to locate nuanced medical findings with better region-phrase correspondences, we further propose Tri-attention Context contrastive alignment (TaCo). TaCo seeks context alignment to pull both the features and attention outputs of relevant region-phrase pairs close together while pushing those of irrelevant regions far away. This ensures that the final box prediction depends more on its finding-specific regions and phrases. Experimental results on three MPG datasets demonstrate that our MedRPG outperforms state-of-the-art visual grounding approaches by a large margin. Additionally, the proposed TaCo strategy is effective in enhancing finding localization ability and reducing spurious region-phrase correlations.
引用
收藏
页码:371 / 381
页数:11
相关论文
共 50 条
  • [1] Realistic Image Generation using Region-phrase Attention
    Huang, Wanming
    Xu, Richard Yi Da
    Oppermann, Ian
    [J]. ASIAN CONFERENCE ON MACHINE LEARNING, VOL 101, 2019, 101 : 284 - 299
  • [2] MSRC: multimodal spatial regression with semantic context for phrase grounding
    Chen, Kan
    Kovvuri, Rama
    Gao, Jiyang
    Nevatia, Ram
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2018, 7 (01) : 17 - 28
  • [3] MSRC: Multimodal Spatial Regression with Semantic Context for Phrase Grounding
    Chen, Kan
    Kovvuri, Rama
    Gao, Jiyang
    Nevatia, Ram
    [J]. PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR'17), 2017, : 23 - 31
  • [4] MSRC: multimodal spatial regression with semantic context for phrase grounding
    Kan Chen
    Rama Kovvuri
    Jiyang Gao
    Ram Nevatia
    [J]. International Journal of Multimedia Information Retrieval, 2018, 7 : 17 - 28
  • [5] MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding
    Wang, Qinxin
    Tan, Hao
    Shen, Sheng
    Mahoney, Michael W.
    Yao, Zhewei
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2030 - 2038
  • [6] Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning
    Wang, Xue
    Du, Youtian
    Verberne, Suzan
    Verbeek, Fons J.
    [J]. APPLIED INTELLIGENCE, 2023, 53 (11) : 14690 - 14702
  • [7] Query-guided Regression Network with Context Policy for Phrase Grounding
    Chen, Kan
    Kovvuri, Rama
    Nevatia, Ram
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 824 - 832
  • [8] Improving weakly supervised phrase grounding via visual representation contextualization with contrastive learning
    Xue Wang
    Youtian Du
    Suzan Verberne
    Fons J. Verbeek
    [J]. Applied Intelligence, 2023, 53 : 14690 - 14702
  • [9] PIRC Net: Using Proposal Indexing, Relationships and Context for Phrase Grounding
    Kovvuri, Rama
    Nevatia, Ram
    [J]. COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 451 - 467
  • [10] Neural Sequential Phrase Grounding (SeqGROUND)
    Dogan, Pelin
    Sigal, Leonid
    Gross, Markus
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 4170 - 4179