PMC-LLaMA: toward building open-source language models for medicine

被引：13

作者：

Wu, Chaoyi ^{[1
,2
]}

Lin, Weixiong ^{[1
,2
]}

Zhang, Xiaoman ^{[1
,2
]}

Zhang, Ya ^{[1
,2
]}

Xie, Weidi ^{[1
,2
]}

Wang, Yanfeng ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, Cooperat Medianet Innovat Ctr CM, Shanghai 200240, Peoples R China

[2] Shanghai AI Lab, Shanghai 200232, Peoples R China

来源：

JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION | 2024年 / 31卷 / 09期

基金：

国家重点研发计划;

关键词：

large language models; biomedical NLP; generative language models; ChatGPT;

D O I：

10.1093/jamia/ocae045

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Objective Recently, large language models (LLMs) have showcased remarkable capabilities in natural language understanding. While demonstrating proficiency in everyday conversations and question-answering (QA) situations, these models frequently struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge. In this article, we describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.Materials and methods We adapt a general-purpose LLM toward the medical domain, involving data-centric knowledge injection through the integration of 4.8M biomedical academic papers and 30K medical textbooks, as well as comprehensive domain-specific instruction fine-tuning, encompassing medical QA, rationale for reasoning, and conversational dialogues with 202M tokens.Results While evaluating various public medical QA benchmarks and manual rating, our lightweight PMC-LLaMA, which consists of only 13B parameters, exhibits superior performance, even surpassing ChatGPT. All models, codes, and datasets for instruction tuning will be released to the research community.Discussion Our contributions are 3-fold: (1) we build up an open-source LLM toward the medical domain. We believe the proposed PMC-LLaMA model can promote further development of foundation models in medicine, serving as a medical trainable basic generative language backbone; (2) we conduct thorough ablation studies to demonstrate the effectiveness of each proposed component, demonstrating how different training data and model scales affect medical LLMs; (3) we contribute a large-scale, comprehensive dataset for instruction tuning.Conclusion In this article, we systematically investigate the process of building up an open-source medical-specific LLM, PMC-LLaMA.

引用

页码：1833 / 1843

页数：11

共 50 条

[1] PharmaLLM: A Medicine Prescriber Chatbot Exploiting Open-Source Large Language Models
Ayesha Azam
Zubaira Naz
Muhammad Usman Ghani Khan
[J]. Human-Centric Intelligent Systems, 2024, 4 (4): : 527 - 544
[2] Servicing open-source large language models for oncology
Ray, Partha Pratim
[J]. ONCOLOGIST, 2024,
[3] Building open-source AI
Shrestha, Yash Raj
von Krogh, Georg
Feuerriegel, Stefan
[J]. NATURE COMPUTATIONAL SCIENCE, 2023, 3 (11): : 908 - 911
[4] Building open-source AI
Yash Raj Shrestha
Georg von Krogh
Stefan Feuerriegel
[J]. Nature Computational Science, 2023, 3 : 908 - 911
[5] Toward Open-source Epidemiology
Goldstein, Neal D.
[J]. EPIDEMIOLOGY, 2018, 29 (02) : 161 - 164
[6] A tutorial on open-source large language models for behavioral science
Hussain, Zak
Binz, Marcel
Mata, Rui
Wulff, Dirk U.
[J]. BEHAVIOR RESEARCH METHODS, 2024, : 8214 - 8237
[7] Preliminary Systematic Review of Open-Source Large Language Models in Education
Lin, Michael Pin-Chuan
Chang, Daniel
Hall, Sarah
Jhajj, Gaganpreet
[J]. GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 68 - 77
[8] TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese
Correa, Nicholas Kluge
Falk, Sophia
Fatimah, Shiza
Sen, Aniket
De Oliveira, Nythamar
[J]. MACHINE LEARNING WITH APPLICATIONS, 2024, 16
[9] OPEN-SOURCE LANGUAGE AI CHALLENGES BIG TECH'S MODELS
Gibney, Elizabeth
[J]. NATURE, 2022, 606 (7916) : 850 - 851
[10] Open-source language AI challenges big tech’s models
Elizabeth Gibney
[J]. Nature, 2022, 606 : 850 - 851

← 1 2 3 4 5 →