Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis

被引:28
|
作者
Song, Haifeng [1 ,2 ]
Xia, Yi [3 ,4 ]
Luo, Zhichao [1 ,2 ]
Liu, Hui [1 ,2 ]
Song, Yan [5 ]
Zeng, Xue [1 ,2 ]
Li, Tianjie [1 ,2 ]
Zhong, Guangxin [1 ,2 ]
Li, Jianxing [1 ,2 ]
Chen, Ming [3 ]
Zhang, Guangyuan [3 ]
Xiao, Bo [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing Tsinghua Changgung Hosp, Sch Clin Med, Dept Urol, 168 Litang Rd, Beijing 102218, Peoples R China
[2] Tsinghua Univ, Inst Urol, Sch Clin Med, Beijing 102218, Peoples R China
[3] Southeast Univ, Zhongda Hosp, Dept Urol, 87 Dingjiaqiao, Nanjing 210009, Peoples R China
[4] Southeast Univ, Sch Med, Nanjing 210009, Peoples R China
[5] China Med Univ, Urol Dept, Sheng Jing Hosp, Shenyang 110000, Peoples R China
关键词
Urolithiasis; Health consultation; Large language model; ChatGPT; Artificial intelligence;
D O I
10.1007/s10916-023-02021-3
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
ObjectivesTo evaluate the effectiveness of four large language models (LLMs) (Claude, Bard, ChatGPT4, and New Bing) that have large user bases and significant social attention, in the context of medical consultation and patient education in urolithiasis.Materials and methodsIn this study, we developed a questionnaire consisting of 21 questions and 2 clinical scenarios related to urolithiasis. Subsequently, clinical consultations were simulated for each of the four models to assess their responses to the questions. Urolithiasis experts then evaluated the model responses in terms of accuracy, comprehensiveness, ease of understanding, human care, and clinical case analysis ability based on a predesigned 5-point Likert scale. Visualization and statistical analyses were then employed to compare the four models and evaluate their performance.ResultsAll models yielded satisfying performance, except for Bard, who failed to provide a valid response to Question 13. Claude consistently scored the highest in all dimensions compared with the other three models. ChatGPT4 ranked second in accuracy, with a relatively stable output across multiple tests, but shortcomings were observed in empathy and human caring. Bard exhibited the lowest accuracy and overall performance. Claude and ChatGPT4 both had a high capacity to analyze clinical cases of urolithiasis. Overall, Claude emerged as the best performer in urolithiasis consultations and education.ConclusionClaude demonstrated superior performance compared with the other three in urolithiasis consultation and education. This study highlights the remarkable potential of LLMs in medical health consultations and patient education, although professional review, further evaluation, and modifications are still required.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Large language models in bariatric surgery patient support: A transformative approach to patient education and engagement
    Samaan, Jamil S.
    Rajeev, Nithya
    Srinivasan, Nitin
    Yeo, Yee Hui
    Samakar, Kamran
    CLINICAL OBESITY, 2024, 14 (02)
  • [42] Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry
    Ozdemir, Zeyneb Merve
    Yapici, Emre
    JOURNAL OF ESTHETIC AND RESTORATIVE DENTISTRY, 2025,
  • [43] Evaluating Large Language Learning Models' Accuracy and Reliability in Addressing Consumer Health Queries
    Chung, Sunny
    Koos, Jessica
    JOURNAL OF CONSUMER HEALTH ON THE INTERNET, 2024, 28 (04) : 395 - 402
  • [44] Assessing the Current Limitations of Large Language Models in Advancing Health Care Education
    Kim, Jaeyong
    Vajravelu, Bathri Narayan
    JMIR FORMATIVE RESEARCH, 2025, 9
  • [45] Benchmarking State-of-the-Art Large Language Models for Migraine Patient Education: Performance Comparison of Responses to Common Queries
    Li, Linger
    Li, Pengfei
    Wang, Kun
    Zhang, Liang
    Ji, Hongwei
    Zhao, Hongqin
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [46] Evaluating the Performance of Large Language Models in Predicting Diagnostics for Spanish Clinical Cases in Cardiology
    Delaunay, Julien
    Cusido, Jordi
    APPLIED SCIENCES-BASEL, 2025, 15 (01):
  • [47] Evaluating the Performance of Large Language Models in Hematopoietic Stem Cell Transplantation Decision Making
    Civettini, Ivan
    Zappaterra, Arianna
    Ramazzotti, Daniele
    Granelli, Bianca Maria
    Rindone, Giovanni
    Aroldi, Andrea
    Bonfanti, Stefano
    Colombo, Federica
    Fedele, Marilena
    Grillo, Giovanni
    Parma, Matteo
    Perfetti, Paola
    Terruzzi, Elisabetta
    Gambacorti-Passerini, Carlo
    Cavalca, Fabrizio
    BLOOD, 2023, 142
  • [48] Readability rescue: large language models may improve readability of patient education materials
    Breneman, Alyssa
    Trager, Megan H.
    Gordon, Emily R.
    Samie, Faramarz H.
    ARCHIVES OF DERMATOLOGICAL RESEARCH, 2024, 316 (09)
  • [49] Evaluating a large language model in simulating different stages of depression and suicidal ideation in medical education
    Philipps, Annika
    Stegemann-Philipps, Christian
    Herrmann-Werner, Anne
    Festl-Wietek, Teresa
    Holderried, Friederike
    PSYCHOTHERAPY AND PSYCHOSOMATICS, 2024, 93 : 130 - 130
  • [50] Evaluating large language models on medical evidence summarization
    Tang, Liyan
    Sun, Zhaoyi
    Idnay, Betina
    Nestor, Jordan G.
    Soroush, Ali
    Elias, Pierre A.
    Xu, Ziyang
    Ding, Ying
    Durrett, Greg
    Rousseau, Justin F.
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2023, 6 (01)