Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation

被引:0
|
作者
Xu, Jie [1 ]
Lu, Lu [1 ]
Peng, Xinwei [1 ]
Pang, Jiali [1 ]
Ding, Jinru [1 ]
Yang, Lingrui [2 ]
Song, Huan [3 ,4 ]
Li, Kang [3 ,4 ]
Sun, Xin [2 ]
Zhang, Shaoting [1 ]
机构
[1] Shanghai Artificial Intelligence Lab, OpenMedLab, Shanghai, Peoples R China
[2] Shanghai Jiao Tong Univ, Xinhua Hosp, Clin Res & Innovat Unit, Sch Med, Shanghai, Peoples R China
[3] Sichuan Univ, West China Hosp, West China Biomed Big Data Ctr, Chengdu, Peoples R China
[4] Sichuan Univ, Medx Ctr Informat, Chengdu, Peoples R China
关键词
ChatGPT; LLM; assessment; data set; benchmark; medicine; DELPHI METHOD;
D O I
10.2196/57674
中图分类号
R-058 [];
学科分类号
摘要
Background: Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs' responses create substantial risks, potentially threatening patients' physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation. Objective: We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks. Methods: First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory. Results: Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario. Conclusions: MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, opensource data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Automating Patch Set Generation from Code Review Comments Using Large Language Models
    Rahman, Tajmilur
    Singh, Rahul
    Sultan, Mir Yousuf
    PROCEEDINGS 2024 IEEE/ACM 3RD INTERNATIONAL CONFERENCE ON AI ENGINEERING-SOFTWARE ENGINEERING FOR AI, CAIN 2024, 2024, : 273 - 274
  • [22] Can Large Language Models Predict Data Correlations from Column Names?
    Trummer, Immanuel
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4310 - 4323
  • [23] From text to treatment: the crucial role of validation for generative large language models in health care
    de Hond, Anne
    Leeuwenberg, Tuur
    Bartels, Richard
    van Buchem, Marieke
    Kant, Ilse
    Moons, Karel G.M.
    van Smeden, Maarten
    The Lancet Digital Health, 6 (07)
  • [24] From text to treatment: the crucial role of validation for generative large language models in health care
    de Hond, Anne
    Leeuwenberg, Tuur
    Bartels, Richard
    van Buchem, Marieke
    Kant, Ilse
    Moons, Karel G. M.
    van Smeden, Maarten
    LANCET DIGITAL HEALTH, 2024, 6 (07): : e441 - e443
  • [25] The Meaning Extraction Method: An Approach to Evaluate Content Patterns From Large-Scale Language Data
    Markowitz, David M.
    FRONTIERS IN COMMUNICATION, 2021, 6
  • [26] Development and Validation of a Set of Palliative Medicine Entrustable Professional Activities: Findings from a Mixed Methods Study
    Myers, Jeff
    Krueger, Paul
    Webster, Fiona
    Downar, James
    Herx, Leonie
    Jeney, Christa
    Oneschuk, Doreen
    Schroder, Cori
    Sirianni, Giovanna
    Seccareccia, Dori
    Tucker, Tara
    Taniguchi, Alan
    JOURNAL OF PALLIATIVE MEDICINE, 2015, 18 (08) : 682 - 690
  • [27] A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone
    Tailor, Prashant D.
    Dalvin, Lauren A.
    Chen, John J.
    Iezzi, Raymond
    Olsen, Timothy W.
    Scruggs, Brittni A.
    Barkmeier, Andrew J.
    Bakri, Sophie J.
    Ryan, Edwin H.
    Tang, Peter H.
    Iii, D. Wilkin. Parke
    Belin, Peter J.
    Sridhar, Jayanth
    Xu, David
    Kuriyan, Ajay E.
    Yonekawa, Yoshihiro
    Starr, Matthew R.
    OPHTHALMOLOGY SCIENCE, 2024, 4 (04):
  • [28] Large language models could change the future of behavioral healthcare: a proposal for responsible development and evaluation
    Elizabeth C. Stade
    Shannon Wiltsey Stirman
    Lyle H. Ungar
    Cody L. Boland
    H. Andrew Schwartz
    David B. Yaden
    João Sedoc
    Robert J. DeRubeis
    Robb Willer
    Johannes C. Eichstaedt
    npj Mental Health Research, 3 (1):
  • [29] An active inference strategy for prompting reliable responses from large language models in medical practice
    Roma Shusterman
    Allison C. Waters
    Shannon O’Neill
    Marshall Bangs
    Phan Luu
    Don M. Tucker
    npj Digital Medicine, 8 (1)
  • [30] Accurate Quantum Chemical Calculation of Ionization Potentials: Validation of the DFT-LOC Approach via a Large Data Set Obtained from Experiments and Benchmark Quantum Chemical Calculations
    Li, Guangqi
    Rudshteyn, Benjamin
    Shee, James
    Weber, John L.
    Coskun, Dilek
    Bochevarov, Art D.
    Friesner, Richard A.
    JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2020, 16 (04) : 2109 - 2123