Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation

被引:0
|
作者
Xu, Jie [1 ]
Lu, Lu [1 ]
Peng, Xinwei [1 ]
Pang, Jiali [1 ]
Ding, Jinru [1 ]
Yang, Lingrui [2 ]
Song, Huan [3 ,4 ]
Li, Kang [3 ,4 ]
Sun, Xin [2 ]
Zhang, Shaoting [1 ]
机构
[1] Shanghai Artificial Intelligence Lab, OpenMedLab, Shanghai, Peoples R China
[2] Shanghai Jiao Tong Univ, Xinhua Hosp, Clin Res & Innovat Unit, Sch Med, Shanghai, Peoples R China
[3] Sichuan Univ, West China Hosp, West China Biomed Big Data Ctr, Chengdu, Peoples R China
[4] Sichuan Univ, Medx Ctr Informat, Chengdu, Peoples R China
关键词
ChatGPT; LLM; assessment; data set; benchmark; medicine; DELPHI METHOD;
D O I
10.2196/57674
中图分类号
R-058 [];
学科分类号
摘要
Background: Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs' responses create substantial risks, potentially threatening patients' physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation. Objective: We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks. Methods: First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory. Results: Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario. Conclusions: MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, opensource data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] CRASS: A Novel Data Set and Benchmark to Test Counterfactual Reasoning of Large Language Models
    Frohberg, Jorg
    Binder, Frank
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 2126 - 2140
  • [2] Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
    Gao, Dawei
    Wang, Haibin
    Li, Yaliang
    Sun, Xiuyu
    Qian, Yichen
    Ding, Bolin
    Zhou, Jingren
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (05): : 1132 - 1145
  • [3] A DATA SET FROM NORTH GERMANY FOR THE VALIDATION OF AGROECOSYSTEM MODELS - DOCUMENTATION AND EVALUATION
    MCVOY, CW
    KERSEBAUM, KC
    ARNING, M
    KLEEBERG, P
    OTHMER, H
    SCHRODER, U
    ECOLOGICAL MODELLING, 1995, 81 (1-3) : 265 - 300
  • [4] PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance
    Xie, Qianqian
    Han, Weiguang
    Zhang, Xiao
    Lai, Yanzhao
    Peng, Min
    Lopez-Lira, Alejandro
    Huang, Jimin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Evaluation of large language models for discovery of gene set function
    Mengzhou Hu
    Sahar Alkhairy
    Ingoo Lee
    Rudolf T. Pillich
    Dylan Fong
    Kevin Smith
    Robin Bachelder
    Trey Ideker
    Dexter Pratt
    Nature Methods, 2025, 22 (1) : 82 - 91
  • [6] A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks
    Jahan, Israt
    Laskar, Md Tahmid Rahman
    Peng, Chun
    Huang, Jimmy Xiangji
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
  • [7] Extracting Training Data from Large Language Models
    Carlini, Nicholas
    Tramer, Florian
    Wallace, Eric
    Jagielski, Matthew
    Herbert-Voss, Ariel
    Lee, Katherine
    Roberts, Adam
    Brown, Tom
    Song, Dawn
    Erlingsson, Ulfar
    Oprea, Alina
    Raffel, Colin
    PROCEEDINGS OF THE 30TH USENIX SECURITY SYMPOSIUM, 2021, : 2633 - 2650
  • [8] Mining experimental data from materials science literature with large language models: an evaluation study
    Foppiano, Luca
    Lambard, Guillaume
    Amagasa, Toshiyuki
    Ishii, Masashi
    SCIENCE AND TECHNOLOGY OF ADVANCED MATERIALS-METHODS, 2024, 4 (01):
  • [9] Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
    Jessica D. Workum
    Bas W. S. Volkers
    Davy van de Sande
    Sumesh Arora
    Marco Goeijenbier
    Diederik Gommers
    Michel E. van Genderen
    Critical Care, 29 (1):
  • [10] Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation and Enhancement Using the StepGame Benchmark
    Li, Fangjun
    Hogg, David C.
    Cohn, Anthony G.
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18500 - 18507