Expert evaluation of large language models for clinical dialogue summarization

被引:0
|
作者
Navarro, David Fraile [1 ]
Coiera, Enrico [1 ]
Hambly, Thomas W. [2 ]
Triplett, Zoe [3 ]
Asif, Nahyan [4 ]
Susanto, Anindya [1 ,5 ]
Chowdhury, Anamika [6 ]
Lorenzo, Amaya Azcoaga [7 ,8 ,9 ]
Dras, Mark [10 ]
Berkovsky, Shlomo [1 ]
机构
[1] Macquarie Univ, Australian Inst Hlth Innovat, Ctr Hlth Informat, Level 6,75 Talavera Rd, Sydney, NSW 2113, Australia
[2] Univ Technol Sydney, Fac Engn & Informat Technol, Sydney, Australia
[3] Macquarie Univ, Fac Human & Hlth Sci, Sch Med, Sydney, Australia
[4] Macquarie Univ Hosp, Sydney, Australia
[5] Univ Indonesia, Fac Med, Jakarta, Indonesia
[6] COWRA DIST HOSP, COWRA, Australia
[7] Madrid Hlth Serv, Hlth Ctr Los Pintores, Madrid, Spain
[8] Fdn Jimenez Diaz, Hlth Res Inst, Madrid, Spain
[9] Univ St Andrews, St Andrews, Scotland
[10] Macquarie Univ, Sch Comp, Sydney, Australia
来源
SCIENTIFIC REPORTS | 2025年 / 15卷 / 01期
关键词
Natural language processing; Electronic health records; Primary care; Artificial intelligence;
D O I
10.1038/s41598-024-84850-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We assessed the performance of large language models' summarizing clinical dialogues using computational metrics and human evaluations. The comparison was done between automatically generated and human-produced summaries. We conducted an exploratory evaluation of five language models: one general summarisation model, one fine-tuned for general dialogues, two fine-tuned with anonymized clinical dialogues, and one Large Language Model (ChatGPT). These models were assessed using ROUGE, UniEval metrics, and expert human evaluation was done by clinicians comparing the generated summaries against a clinician generated summary (gold standard). The fine-tuned transformer model scored the highest when evaluated with ROUGE, while ChatGPT scored the lowest overall. However, using UniEval, ChatGPT scored the highest across all the evaluated domains (coherence 0.957, consistency 0.7583, fluency 0.947, and relevance 0.947 and overall score 0.9891). Similar results were obtained when the systems were evaluated by clinicians, with ChatGPT scoring the highest in four domains (coherency 0.573, consistency 0.908, fluency 0.96 and overall clinical use 0.862). Statistical analyses showed differences between ChatGPT and human summaries vs. all other models. These exploratory results indicate that ChatGPT's performance in summarizing clinical dialogues approached the quality of human summaries. The study also found that the ROUGE metrics may not be reliable for evaluating clinical summary generation, whereas UniEval correlated well with human ratings. Large language models may provide a successful path for automating clinical dialogue summarization. Privacy concerns and the restricted nature of health records remain challenges for its integration. Further evaluations using diverse clinical dialogues and multiple initialization seeds are needed to verify the reliability and generalizability of automatically generated summaries.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] A Comprehensive Evaluation of Large Language Models for Turkish Abstractive Dialogue Summarization
    Buyuk, Osman
    IEEE ACCESS, 2024, 12 : 124391 - 124401
  • [2] Dialogue Summarization with Mixture of Experts based on Large Language Models
    Tian, Yuanhe
    Xia, Fei
    Song, Yan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7143 - 7155
  • [3] Factual consistency evaluation of summarization in the Era of large language models
    Luo, Zheheng
    Xie, Qianqian
    Ananiadou, Sophia
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 254
  • [4] Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method
    Wang, Yiming
    Zhang, Zhuosheng
    Wang, Rui
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 8640 - 8665
  • [5] Leveraging Large Language Models (LLMs) For Randomized Clinical Trial Summarization
    Mangla, Anjali
    Thangaraj, Phyllis
    Khera, Rohan
    CIRCULATION, 2024, 150
  • [6] Fine tuning the large language pegasus model for dialogue summarization
    Vinay Sarthak
    Preeti Rishiwal
    Mano Yadav
    Sushil Yadav
    Ashutosh Gangwar
    undefined Shankdhar
    International Journal of Information Technology, 2025, 17 (2) : 1165 - 1177
  • [7] Effectiveness of French Language Models on Abstractive Dialogue Summarization Task
    Zhou, Yongxin
    Portet, Francois
    Ringeval, Fabien
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3571 - 3581
  • [8] Benchmarking Large Language Models for News Summarization
    Zhang, Tianyi
    Ladhak, Faisal
    Durmus, Esin
    Liang, Percy
    Mckeown, Kathleen
    Hashimoto, Tatsunori B.
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 39 - 57
  • [9] Adapted large language models can outperform medical experts in clinical text summarization
    Van Veen, Dave
    Van Uden, Cara
    Blankemeier, Louis
    Delbrouck, Jean-Benoit
    Aali, Asad
    Bluethgen, Christian
    Pareek, Anuj
    Polacin, Malgorzata
    Reis, Eduardo Pontes
    Seehofnerova, Anna
    Rohatgi, Nidhi
    Hosamani, Poonam
    Collins, William
    Ahuja, Neera
    Langlotz, Curtis P.
    Hom, Jason
    Gatidis, Sergios
    Pauly, John
    Chaudhari, Akshay S.
    NATURE MEDICINE, 2024, 30 (04) : 1134 - 1142
  • [10] Evaluating large language models on medical evidence summarization
    Tang, Liyan
    Sun, Zhaoyi
    Idnay, Betina
    Nestor, Jordan G.
    Soroush, Ali
    Elias, Pierre A.
    Xu, Ziyang
    Ding, Ying
    Durrett, Greg
    Rousseau, Justin F.
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2023, 6 (01)