Expert evaluation of large language models for clinical dialogue summarization

被引:0
|
作者
Navarro, David Fraile [1 ]
Coiera, Enrico [1 ]
Hambly, Thomas W. [2 ]
Triplett, Zoe [3 ]
Asif, Nahyan [4 ]
Susanto, Anindya [1 ,5 ]
Chowdhury, Anamika [6 ]
Lorenzo, Amaya Azcoaga [7 ,8 ,9 ]
Dras, Mark [10 ]
Berkovsky, Shlomo [1 ]
机构
[1] Macquarie Univ, Australian Inst Hlth Innovat, Ctr Hlth Informat, Level 6,75 Talavera Rd, Sydney, NSW 2113, Australia
[2] Univ Technol Sydney, Fac Engn & Informat Technol, Sydney, Australia
[3] Macquarie Univ, Fac Human & Hlth Sci, Sch Med, Sydney, Australia
[4] Macquarie Univ Hosp, Sydney, Australia
[5] Univ Indonesia, Fac Med, Jakarta, Indonesia
[6] COWRA DIST HOSP, COWRA, Australia
[7] Madrid Hlth Serv, Hlth Ctr Los Pintores, Madrid, Spain
[8] Fdn Jimenez Diaz, Hlth Res Inst, Madrid, Spain
[9] Univ St Andrews, St Andrews, Scotland
[10] Macquarie Univ, Sch Comp, Sydney, Australia
来源
SCIENTIFIC REPORTS | 2025年 / 15卷 / 01期
关键词
Natural language processing; Electronic health records; Primary care; Artificial intelligence;
D O I
10.1038/s41598-024-84850-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We assessed the performance of large language models' summarizing clinical dialogues using computational metrics and human evaluations. The comparison was done between automatically generated and human-produced summaries. We conducted an exploratory evaluation of five language models: one general summarisation model, one fine-tuned for general dialogues, two fine-tuned with anonymized clinical dialogues, and one Large Language Model (ChatGPT). These models were assessed using ROUGE, UniEval metrics, and expert human evaluation was done by clinicians comparing the generated summaries against a clinician generated summary (gold standard). The fine-tuned transformer model scored the highest when evaluated with ROUGE, while ChatGPT scored the lowest overall. However, using UniEval, ChatGPT scored the highest across all the evaluated domains (coherence 0.957, consistency 0.7583, fluency 0.947, and relevance 0.947 and overall score 0.9891). Similar results were obtained when the systems were evaluated by clinicians, with ChatGPT scoring the highest in four domains (coherency 0.573, consistency 0.908, fluency 0.96 and overall clinical use 0.862). Statistical analyses showed differences between ChatGPT and human summaries vs. all other models. These exploratory results indicate that ChatGPT's performance in summarizing clinical dialogues approached the quality of human summaries. The study also found that the ROUGE metrics may not be reliable for evaluating clinical summary generation, whereas UniEval correlated well with human ratings. Large language models may provide a successful path for automating clinical dialogue summarization. Privacy concerns and the restricted nature of health records remain challenges for its integration. Further evaluations using diverse clinical dialogues and multiple initialization seeds are needed to verify the reliability and generalizability of automatically generated summaries.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Evaluation and Analysis of Large Language Models for Clinical Text Augmentation and Generation
    Latif, Atif
    Kim, Jihie
    IEEE ACCESS, 2024, 12 : 48987 - 48996
  • [32] Regarding the evaluation of large language models in breast cancer clinical scenarios
    Li, Rongkang
    Peng, Lei
    Liang, Rui
    Zhao, Anguo
    Zhong, Jianye
    Zhang, Shaohua
    INTERNATIONAL JOURNAL OF SURGERY, 2024, 110 (12) : 8183 - 8184
  • [33] Fine-Tuning Pretrained Language Models to Enhance Dialogue Summarization in Customer Service Centers
    Yun, Jiseon
    Sohn, Jae Eui
    Kyeong, Sunghyon
    PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 365 - 373
  • [34] Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization
    Feng, Xiachong
    Feng, Xiaocheng
    Qin, Libo
    Qin, Bing
    Liu, Ting
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 1479 - 1491
  • [35] A Survey on Evaluation of Large Language Models
    Chang, Yupeng
    Wang, Xu
    Wang, Jindong
    Wu, Yuan
    Yang, Linyi
    Zhu, Kaijie
    Chen, Hao
    Yi, Xiaoyuan
    Wang, Cunxiang
    Wang, Yidong
    Ye, Wei
    Zhang, Yue
    Chang, Yi
    Yu, Philip S.
    Yang, Qiang
    Xie, Xing
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
  • [36] Ask an Expert: Leveraging Language Models to Improve Strategic Reasoning in Goal-Oriented Dialogue Models
    Zhang, Qiang
    Naradowsky, Jason
    Miyao, Yusuke
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 6665 - 6694
  • [37] From Moments to Milestones: Incremental Timeline Summarization Leveraging Large Language Models
    Hu, Qisheng
    Moon, Geonsik
    Ng, Hwee Tou
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 7232 - 7246
  • [38] Political Bias of Large Language Models in Few-Shot News Summarization
    Onishi, Takeshi
    Caverlee, James
    ADVANCES IN BIAS AND FAIRNESS IN INFORMATION RETRIEVAL, BIAS 2024, 2025, 2227 : 32 - 45
  • [39] Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization
    Shen, Chenhui
    Cheng, Liying
    Xuan-Phi Nguyen
    You, Yang
    Bing, Lidong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4215 - 4233
  • [40] Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
    Subbiah, Melanie
    Zhang, Sean
    Chilton, Lydia B.
    Mckeown, Kathleen
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 1290 - 1310