Pathologyvlm: a large vision-language model for pathology image understanding

被引:0
|
作者
Dawei Dai [1 ]
Yuanhui Zhang [1 ]
Qianlan Yang [2 ]
Long Xu [1 ]
Xiaojing Shen [2 ]
Shuyin Xia [1 ]
Guoyin Wang [1 ]
机构
[1] Chongqing University of Posts and Telecommunications,Chongqing Key Laboratory of Computational Intelligence
[2] School of Medicine Tongji University,Shanghai First Maternity and Infant Hospital
[3] Chongqing Normal University,College of Computer and Information Science
关键词
Pathology image understanding; VQA; VLM; Multi-modal;
D O I
10.1007/s10462-025-11190-1
中图分类号
学科分类号
摘要
The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies have demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large vision-language model (PathologyVLM) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder to extract the features of pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PathologyVLM, first stage for domain alignment, and second stage for end to end visual question & answering (VQA) task. In experiments, we evaluate our PathologyVLM on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PathologyVLM model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA
引用
收藏
相关论文
共 50 条
  • [1] FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding
    Song, Duanxiao
    Gao, Dehong
    Liu, Gongshen
    Li, Xiaoyong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT V, 2024, 15020 : 308 - 323
  • [2] Attention Prompting on Image for Large Vision-Language Models
    Yu, Runpeng
    Yu, Weihao
    Wang, Xinchao
    COMPUTER VISION - ECCV 2024, PT XXX, 2025, 15088 : 251 - 268
  • [3] Graph neural networks in vision-language image understanding: a survey
    Senior, Henry
    Slabaugh, Gregory
    Yuan, Shanxin
    Rossi, Luca
    VISUAL COMPUTER, 2025, 41 (01): : 491 - 516
  • [4] Graph neural networks in vision-language image understanding: a surveyGraph neural networks in vision-language image understanding: a surveyH. Senior et al.
    Henry Senior
    Gregory Slabaugh
    Shanxin Yuan
    Luca Rossi
    The Visual Computer, 2025, 41 (1) : 491 - 516
  • [5] Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model
    Wei, Haoran
    Kong, Lingyu
    Chen, Jinyue
    Zhao, Liang
    Ge, Zheng
    Yang, Jinrong
    Sun, Jianjian
    Han, Chunrui
    Zhang, Xiangyu
    COMPUTER VISION-ECCV 2024, PT IV, 2025, 15062 : 408 - 424
  • [6] Vision-language AI assistance in human pathology
    Marchal, Iris
    NATURE BIOTECHNOLOGY, 2024, 42 (07) : 1027 - 1027
  • [7] Continual Learning of Image Classes With Language Guidance From a Vision-Language Model
    Zhang, Wentao
    Huang, Yujun
    Zhang, Weizhuo
    Zhang, Tong
    Lao, Qicheng
    Yu, Yue
    Zheng, Wei-Shi
    Wang, Ruixuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 13152 - 13163
  • [8] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
    Wang, Wenhui
    Bao, Hangbo
    Dong, Li
    Bjorck, Johan
    Peng, Zhiliang
    Liu, Qiang
    Aggarwal, Kriti
    Mohammed, Owais Khan
    Singhal, Saksham
    Som, Subhojit
    Wei, Furu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
  • [9] QViLa: Quantum Infused Vision-Language Model for Enhanced Multimodal Understanding
    K. Mukesh
    S. L. Jayaprakash
    R. Prasanna Kumar
    SN Computer Science, 5 (8)
  • [10] Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
    Li, Xuanlin
    Fang, Yunhao
    Liu, Minghua
    Ling, Zhan
    Tu, Zhuowen
    Su, Hao
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2492 - 2503