Pathologyvlm: a large vision-language model for pathology image understanding

被引：0

作者：

Dawei Dai ^{[1
]}

Yuanhui Zhang ^{[1
]}

Qianlan Yang ^{[2
]}

Long Xu ^{[1
]}

Xiaojing Shen ^{[2
]}

Shuyin Xia ^{[1
]}

Guoyin Wang ^{[1
]}

机构：

[1] Chongqing University of Posts and Telecommunications,Chongqing Key Laboratory of Computational Intelligence

[2] School of Medicine Tongji University,Shanghai First Maternity and Infant Hospital

[3] Chongqing Normal University,College of Computer and Information Science

来源：

Artificial Intelligence Review | / 58卷 / 6期

关键词：

Pathology image understanding; VQA; VLM; Multi-modal;

D O I：

10.1007/s10462-025-11190-1

中图分类号：

学科分类号：

摘要：

The previous advancements in pathology image understanding primarily involved developing models tailored to specific tasks. Recent studies have demonstrated that the large vision-language model can enhance the performance of various downstream tasks in medical image understanding. In this study, we developed a domain-specific large vision-language model (PathologyVLM) for pathology image understanding. Specifically, (1) we first construct a human pathology image-text dataset by cleaning the public medical image-text data for domain-specific alignment; (2) Using the proposed image-text data, we first train a pathology language-image pretraining (PLIP) model as the specialized visual encoder to extract the features of pathology image, and then we developed scale-invariant connector to avoid the information loss caused by image scaling; (3) We adopt two-stage learning to train PathologyVLM, first stage for domain alignment, and second stage for end to end visual question & answering (VQA) task. In experiments, we evaluate our PathologyVLM on both supervised and zero-shot VQA datasets, our model achieved the best overall performance among multimodal models of similar scale. The ablation experiments also confirmed the effectiveness of our design. We posit that our PathologyVLM model and the datasets presented in this work can promote research in field of computational pathology. All codes are available at: https://github.com/ddw2AIGROUP2CQUPT/PA-LLaVA

引用

共 50 条

[1] FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding
Song, Duanxiao
Gao, Dehong
Liu, Gongshen
Li, Xiaoyong
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT V, 2024, 15020 : 308 - 323
[2] Attention Prompting on Image for Large Vision-Language Models
Yu, Runpeng
Yu, Weihao
Wang, Xinchao
COMPUTER VISION - ECCV 2024, PT XXX, 2025, 15088 : 251 - 268
[3] Graph neural networks in vision-language image understanding: a survey
Senior, Henry
Slabaugh, Gregory
Yuan, Shanxin
Rossi, Luca
VISUAL COMPUTER, 2025, 41 (01): : 491 - 516
[4] Graph neural networks in vision-language image understanding: a surveyGraph neural networks in vision-language image understanding: a surveyH. Senior et al.
Henry Senior
Gregory Slabaugh
Shanxin Yuan
Luca Rossi
The Visual Computer, 2025, 41 (1) : 491 - 516
[5] Vary: Scaling up the Vision Vocabulary for Large Vision-Language Model
Wei, Haoran
Kong, Lingyu
Chen, Jinyue
Zhao, Liang
Ge, Zheng
Yang, Jinrong
Sun, Jianjian
Han, Chunrui
Zhang, Xiangyu
COMPUTER VISION-ECCV 2024, PT IV, 2025, 15062 : 408 - 424
[6] Vision-language AI assistance in human pathology
Marchal, Iris
NATURE BIOTECHNOLOGY, 2024, 42 (07) : 1027 - 1027
[7] Continual Learning of Image Classes With Language Guidance From a Vision-Language Model
Zhang, Wentao
Huang, Yujun
Zhang, Weizhuo
Zhang, Tong
Lao, Qicheng
Yu, Yue
Zheng, Wei-Shi
Wang, Ruixuan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 13152 - 13163
[8] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
Wang, Wenhui
Bao, Hangbo
Dong, Li
Bjorck, Johan
Peng, Zhiliang
Liu, Qiang
Aggarwal, Kriti
Mohammed, Owais Khan
Singhal, Saksham
Som, Subhojit
Wei, Furu
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
[9] QViLa: Quantum Infused Vision-Language Model for Enhanced Multimodal Understanding
K. Mukesh
S. L. Jayaprakash
R. Prasanna Kumar
SN Computer Science, 5 (8)
[10] Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
Li, Xuanlin
Fang, Yunhao
Liu, Minghua
Ling, Zhan
Tu, Zhuowen
Su, Hao
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2492 - 2503

← 1 2 3 4 5 →