Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis

被引:28
|
作者
Song, Haifeng [1 ,2 ]
Xia, Yi [3 ,4 ]
Luo, Zhichao [1 ,2 ]
Liu, Hui [1 ,2 ]
Song, Yan [5 ]
Zeng, Xue [1 ,2 ]
Li, Tianjie [1 ,2 ]
Zhong, Guangxin [1 ,2 ]
Li, Jianxing [1 ,2 ]
Chen, Ming [3 ]
Zhang, Guangyuan [3 ]
Xiao, Bo [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing Tsinghua Changgung Hosp, Sch Clin Med, Dept Urol, 168 Litang Rd, Beijing 102218, Peoples R China
[2] Tsinghua Univ, Inst Urol, Sch Clin Med, Beijing 102218, Peoples R China
[3] Southeast Univ, Zhongda Hosp, Dept Urol, 87 Dingjiaqiao, Nanjing 210009, Peoples R China
[4] Southeast Univ, Sch Med, Nanjing 210009, Peoples R China
[5] China Med Univ, Urol Dept, Sheng Jing Hosp, Shenyang 110000, Peoples R China
关键词
Urolithiasis; Health consultation; Large language model; ChatGPT; Artificial intelligence;
D O I
10.1007/s10916-023-02021-3
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
ObjectivesTo evaluate the effectiveness of four large language models (LLMs) (Claude, Bard, ChatGPT4, and New Bing) that have large user bases and significant social attention, in the context of medical consultation and patient education in urolithiasis.Materials and methodsIn this study, we developed a questionnaire consisting of 21 questions and 2 clinical scenarios related to urolithiasis. Subsequently, clinical consultations were simulated for each of the four models to assess their responses to the questions. Urolithiasis experts then evaluated the model responses in terms of accuracy, comprehensiveness, ease of understanding, human care, and clinical case analysis ability based on a predesigned 5-point Likert scale. Visualization and statistical analyses were then employed to compare the four models and evaluate their performance.ResultsAll models yielded satisfying performance, except for Bard, who failed to provide a valid response to Question 13. Claude consistently scored the highest in all dimensions compared with the other three models. ChatGPT4 ranked second in accuracy, with a relatively stable output across multiple tests, but shortcomings were observed in empathy and human caring. Bard exhibited the lowest accuracy and overall performance. Claude and ChatGPT4 both had a high capacity to analyze clinical cases of urolithiasis. Overall, Claude emerged as the best performer in urolithiasis consultations and education.ConclusionClaude demonstrated superior performance compared with the other three in urolithiasis consultation and education. This study highlights the remarkable potential of LLMs in medical health consultations and patient education, although professional review, further evaluation, and modifications are still required.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Evaluating Large Language Models for Material Selection
    Grandi, Daniele
    Jain, Yash Patawari
    Groom, Allin
    Cramer, Brandon
    Mccomb, Christopher
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [32] Evaluating large language models in pediatric nephrology
    Filler, Guido
    Niel, Olivier
    PEDIATRIC NEPHROLOGY, 2025,
  • [33] Evaluating large language models as agents in the clinic
    Nikita Mehandru
    Brenda Y. Miao
    Eduardo Rodriguez Almaraz
    Madhumita Sushil
    Atul J. Butte
    Ahmed Alaa
    npj Digital Medicine, 7
  • [34] EVALUATING LARGE LANGUAGE MODELS ON THEIR ACCURACY AND COMPLETENESS
    Edalat, Camellia
    Kirupaharan, Nila
    Dalvin, Lauren A.
    Mishra, Kapil
    Marshall, Rayna
    Xu, Hannah
    Francis, Jasmine H.
    Berkenstock, Meghan
    RETINA-THE JOURNAL OF RETINAL AND VITREOUS DISEASES, 2025, 45 (01): : 128 - 132
  • [35] From Search Engines to Large Language Models: A Big Leap for Patient Education!
    Barabino, Emanuele
    Cittadini, Giuseppe
    CARDIOVASCULAR AND INTERVENTIONAL RADIOLOGY, 2024, 47 (02) : 251 - 252
  • [36] Advancing Patient Education in Idiopathic Intracranial Hypertension The Promise of Large Language Models
    Dihan, Qais A.
    Brown, Andrew D.
    Zaldivar, Ana T.
    Chauhan, Muhammad Z.
    Eleiwa, Taher K.
    Hassan, Amr K.
    Solyman, Omar
    Gise, Ryan
    Phillips, Paul H.
    Sallam, Ahmed B.
    Elhusseiny, Abdelrahman M.
    NEUROLOGY-CLINICAL PRACTICE, 2025, 15 (01)
  • [37] From Search Engines to Large Language Models: A Big Leap for Patient Education!
    Emanuele Barabino
    Giuseppe Cittadini
    CardioVascular and Interventional Radiology, 2024, 47 : 251 - 252
  • [38] Evaluating Intelligence and Knowledge in Large Language Models
    Bianchini, Francesco
    TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
  • [39] Evaluating large language models for software testing
    Li, Yihao
    Liu, Pan
    Wang, Haiyang
    Chu, Jie
    Wong, W. Eric
    COMPUTER STANDARDS & INTERFACES, 2025, 93
  • [40] Evaluating large language models as agents in the clinic
    Mehandru, Nikita
    Miao, Brenda Y.
    Almaraz, Eduardo Rodriguez
    Sushil, Madhumita
    Butte, Atul J.
    Alaa, Ahmed
    NPJ DIGITAL MEDICINE, 2024, 7 (01)