Expert evaluation of large language models for clinical dialogue summarization

被引:0
|
作者
Navarro, David Fraile [1 ]
Coiera, Enrico [1 ]
Hambly, Thomas W. [2 ]
Triplett, Zoe [3 ]
Asif, Nahyan [4 ]
Susanto, Anindya [1 ,5 ]
Chowdhury, Anamika [6 ]
Lorenzo, Amaya Azcoaga [7 ,8 ,9 ]
Dras, Mark [10 ]
Berkovsky, Shlomo [1 ]
机构
[1] Macquarie Univ, Australian Inst Hlth Innovat, Ctr Hlth Informat, Level 6,75 Talavera Rd, Sydney, NSW 2113, Australia
[2] Univ Technol Sydney, Fac Engn & Informat Technol, Sydney, Australia
[3] Macquarie Univ, Fac Human & Hlth Sci, Sch Med, Sydney, Australia
[4] Macquarie Univ Hosp, Sydney, Australia
[5] Univ Indonesia, Fac Med, Jakarta, Indonesia
[6] COWRA DIST HOSP, COWRA, Australia
[7] Madrid Hlth Serv, Hlth Ctr Los Pintores, Madrid, Spain
[8] Fdn Jimenez Diaz, Hlth Res Inst, Madrid, Spain
[9] Univ St Andrews, St Andrews, Scotland
[10] Macquarie Univ, Sch Comp, Sydney, Australia
来源
SCIENTIFIC REPORTS | 2025年 / 15卷 / 01期
关键词
Natural language processing; Electronic health records; Primary care; Artificial intelligence;
D O I
10.1038/s41598-024-84850-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We assessed the performance of large language models' summarizing clinical dialogues using computational metrics and human evaluations. The comparison was done between automatically generated and human-produced summaries. We conducted an exploratory evaluation of five language models: one general summarisation model, one fine-tuned for general dialogues, two fine-tuned with anonymized clinical dialogues, and one Large Language Model (ChatGPT). These models were assessed using ROUGE, UniEval metrics, and expert human evaluation was done by clinicians comparing the generated summaries against a clinician generated summary (gold standard). The fine-tuned transformer model scored the highest when evaluated with ROUGE, while ChatGPT scored the lowest overall. However, using UniEval, ChatGPT scored the highest across all the evaluated domains (coherence 0.957, consistency 0.7583, fluency 0.947, and relevance 0.947 and overall score 0.9891). Similar results were obtained when the systems were evaluated by clinicians, with ChatGPT scoring the highest in four domains (coherency 0.573, consistency 0.908, fluency 0.96 and overall clinical use 0.862). Statistical analyses showed differences between ChatGPT and human summaries vs. all other models. These exploratory results indicate that ChatGPT's performance in summarizing clinical dialogues approached the quality of human summaries. The study also found that the ROUGE metrics may not be reliable for evaluating clinical summary generation, whereas UniEval correlated well with human ratings. Large language models may provide a successful path for automating clinical dialogue summarization. Privacy concerns and the restricted nature of health records remain challenges for its integration. Further evaluations using diverse clinical dialogues and multiple initialization seeds are needed to verify the reliability and generalizability of automatically generated summaries.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] An evaluation framework for clinical use of large language models in patient interaction tasks
    Johri, Shreya
    Jeong, Jaehwan
    Tran, Benjamin A.
    Schlessinger, Daniel I.
    Wongvibulsin, Shannon
    Barnes, Leandra A.
    Zhou, Hong-Yu
    Cai, Zhuo Ran
    Van Allen, Eliezer M.
    Kim, David
    Daneshjou, Roxana
    Rajpurkar, Pranav
    NATURE MEDICINE, 2025, 31 (01) : 77 - 86
  • [42] AUGESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation
    Zheng, Chujie
    Sabour, Sahand
    Wen, Jiaxin
    Zhang, Zheng
    Huang, Minlie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 1552 - 1568
  • [43] Evaluation and mitigation of the limitations of large language models in clinical decision-making
    Hager, Paul
    Jungmann, Friederike
    Holland, Robbie
    Bhagat, Kunal
    Hubrecht, Inga
    Knauer, Manuel
    Vielhauer, Jakob
    Makowski, Marcus
    Braren, Rickmer
    Kaissis, Georgios
    Rueckert, Daniel
    NATURE MEDICINE, 2024, 30 (09) : 2613 - 2622
  • [44] Steganographic Text Generation Based on Large Language Models in Dialogue Scenarios
    Zeng, Qingwei
    Wang, Kaixi
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 475 - 487
  • [45] A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
    Zhang, Chen
    D'Haro, Luis Fernando
    Chen, Yiming
    Zhang, Malu
    Li, Haizhou
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19515 - 19524
  • [46] Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
    Workum, Jessica D.
    Volkers, Bas W. S.
    van de Sande, Davy
    Arora, Sumesh
    Goeijenbier, Marco
    Gommers, Diederik
    van Genderen, Michel E.
    CRITICAL CARE, 2025, 29 (01)
  • [47] A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone
    Tailor, Prashant D.
    Dalvin, Lauren A.
    Chen, John J.
    Iezzi, Raymond
    Olsen, Timothy W.
    Scruggs, Brittni A.
    Barkmeier, Andrew J.
    Bakri, Sophie J.
    Ryan, Edwin H.
    Tang, Peter H.
    Parke, D. Wilkin., III
    Belin, Peter J.
    Sridhar, Jayanth
    Xu, David
    Kuriyan, Ajay E.
    Yonekawa, Yoshihiro
    Starr, Matthew R.
    OPHTHALMOLOGY SCIENCE, 2024, 4 (04):
  • [48] CLINICAL PSYCHOLOGIST AS EXPERT WITNESS - A DIALOGUE
    PAITICH, D
    CANADIAN PSYCHOLOGIST-PSYCHOLOGIE CANADIENNE, 1966, A 7 (05): : 407 - &
  • [49] CPLLM: Clinical prediction with large language models
    Ben Shoham, Ofir
    Rappoport, Nadav
    PLOS DIGITAL HEALTH, 2024, 3 (12):
  • [50] Large language models encode clinical knowledge
    Singhal, Karan
    Azizi, Shekoofeh
    Tu, Tao
    Mahdavi, S. Sara
    Wei, Jason
    Chung, Hyung Won
    Scales, Nathan
    Tanwani, Ajay
    Cole-Lewis, Heather
    Pfohl, Stephen
    Payne, Perry
    Seneviratne, Martin
    Gamble, Paul
    Kelly, Chris
    Babiker, Abubakr
    Schaerli, Nathanael
    Chowdhery, Aakanksha
    Mansfield, Philip
    Demner-Fushman, Dina
    Arcas, Blaise Aguera y
    Webster, Dale
    Corrado, Greg S.
    Matias, Yossi
    Chou, Katherine
    Gottweis, Juraj
    Tomasev, Nenad
    Liu, Yun
    Rajkomar, Alvin
    Barral, Joelle
    Semturs, Christopher
    Karthikesalingam, Alan
    Natarajan, Vivek
    NATURE, 2023, 620 (7972) : 172 - +