Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis

被引:28
|
作者
Song, Haifeng [1 ,2 ]
Xia, Yi [3 ,4 ]
Luo, Zhichao [1 ,2 ]
Liu, Hui [1 ,2 ]
Song, Yan [5 ]
Zeng, Xue [1 ,2 ]
Li, Tianjie [1 ,2 ]
Zhong, Guangxin [1 ,2 ]
Li, Jianxing [1 ,2 ]
Chen, Ming [3 ]
Zhang, Guangyuan [3 ]
Xiao, Bo [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing Tsinghua Changgung Hosp, Sch Clin Med, Dept Urol, 168 Litang Rd, Beijing 102218, Peoples R China
[2] Tsinghua Univ, Inst Urol, Sch Clin Med, Beijing 102218, Peoples R China
[3] Southeast Univ, Zhongda Hosp, Dept Urol, 87 Dingjiaqiao, Nanjing 210009, Peoples R China
[4] Southeast Univ, Sch Med, Nanjing 210009, Peoples R China
[5] China Med Univ, Urol Dept, Sheng Jing Hosp, Shenyang 110000, Peoples R China
关键词
Urolithiasis; Health consultation; Large language model; ChatGPT; Artificial intelligence;
D O I
10.1007/s10916-023-02021-3
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
ObjectivesTo evaluate the effectiveness of four large language models (LLMs) (Claude, Bard, ChatGPT4, and New Bing) that have large user bases and significant social attention, in the context of medical consultation and patient education in urolithiasis.Materials and methodsIn this study, we developed a questionnaire consisting of 21 questions and 2 clinical scenarios related to urolithiasis. Subsequently, clinical consultations were simulated for each of the four models to assess their responses to the questions. Urolithiasis experts then evaluated the model responses in terms of accuracy, comprehensiveness, ease of understanding, human care, and clinical case analysis ability based on a predesigned 5-point Likert scale. Visualization and statistical analyses were then employed to compare the four models and evaluate their performance.ResultsAll models yielded satisfying performance, except for Bard, who failed to provide a valid response to Question 13. Claude consistently scored the highest in all dimensions compared with the other three models. ChatGPT4 ranked second in accuracy, with a relatively stable output across multiple tests, but shortcomings were observed in empathy and human caring. Bard exhibited the lowest accuracy and overall performance. Claude and ChatGPT4 both had a high capacity to analyze clinical cases of urolithiasis. Overall, Claude emerged as the best performer in urolithiasis consultations and education.ConclusionClaude demonstrated superior performance compared with the other three in urolithiasis consultation and education. This study highlights the remarkable potential of LLMs in medical health consultations and patient education, although professional review, further evaluation, and modifications are still required.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] EVALUATING THE PERFORMANCE OF DIFFERENT LARGE LANGUAGE MODELS ON HEALTH CONSULTATION AND PATIENT EDUCATION IN UROLITHIASIS
    Song, Haifeng
    Xia, Yi
    Song, Yan
    Li, Jianxing
    Zhang, Guangyuan
    Xiao, Bo
    JOURNAL OF UROLOGY, 2024, 211 (05): : E391 - E392
  • [2] Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis
    Haifeng Song
    Yi Xia
    Zhichao Luo
    Hui Liu
    Yan Song
    Xue Zeng
    Tianjie Li
    Guangxin Zhong
    Jianxing Li
    Ming Chen
    Guangyuan Zhang
    Bo Xiao
    Journal of Medical Systems, 47
  • [3] Evaluating the effectiveness of large language models in patient education for conjunctivitis
    Wang, Jingyuan
    Shi, Runhan
    Le, Qihua
    Shan, Kun
    Chen, Zhi
    Zhou, Xujiao
    He, Yao
    Hong, Jiaxu
    BRITISH JOURNAL OF OPHTHALMOLOGY, 2024,
  • [4] Evaluating large language models as patient education tools for inflammatory bowel disease: A comparative study
    Zhang, Yan
    Wan, Xiao-Han
    Kong, Qing-Zhou
    Liu, Han
    Liu, Jun
    Guo, Jing
    Yang, Xiao-Yun
    Zuo, Xiu-Li
    Li, Yan-Qing
    WORLD JOURNAL OF GASTROENTEROLOGY, 2025, 31 (06)
  • [5] Performance Assessment of Large Language Models in Medical Consultation: Comparative Study
    Seo, Sujeong
    Kim, Kyuli
    Yang, Heyoung
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [6] More Is Different: Large Language Models in Health Care
    Lungren, Matthew P.
    Fishman, Elliot K.
    Chu, Linda C.
    Rizk, Ryan C.
    Rowe, Steven P.
    JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2024, 21 (07) : 1151 - 1154
  • [7] Evaluating the Performance of Large Language Models for Spanish Language in Undergraduate Admissions Exams
    Miranda, Sabino
    Pichardo-Lagunas, Obdulia
    Martinez-Seis, Bella
    Baldi, Pierre
    COMPUTACION Y SISTEMAS, 2023, 27 (04): : 1241 - 1248
  • [8] Tailoring glaucoma education using large language models: Addressing health disparities in patient comprehension
    Spina, Aidin C.
    Fereydouni, Pirooz
    Tang, Jordan N.
    Andalib, Saman
    Picton, Bryce G.
    Fox, Austin R.
    MEDICINE, 2025, 104 (02)
  • [9] Evaluating the Application of Large Language Models to Generate Feedback in Programming Education
    Jacobs, Sven
    Jaschke, Steffen
    2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,
  • [10] Evaluating the performance of Large Language Models in responding to patients' health queries: A comparative analysis with medical experts
    Yan, Z.
    Lu, S.
    Xu, D.
    Yang, Y.
    Wang, H.
    Mao, J.
    Fan, Y.
    Chen, Y.
    Tseng, H. C.
    JOURNAL OF CROHNS & COLITIS, 2024, 18 : I1338 - I1339