Expert evaluation of large language models for clinical dialogue summarization

被引:0
|
作者
Navarro, David Fraile [1 ]
Coiera, Enrico [1 ]
Hambly, Thomas W. [2 ]
Triplett, Zoe [3 ]
Asif, Nahyan [4 ]
Susanto, Anindya [1 ,5 ]
Chowdhury, Anamika [6 ]
Lorenzo, Amaya Azcoaga [7 ,8 ,9 ]
Dras, Mark [10 ]
Berkovsky, Shlomo [1 ]
机构
[1] Macquarie Univ, Australian Inst Hlth Innovat, Ctr Hlth Informat, Level 6,75 Talavera Rd, Sydney, NSW 2113, Australia
[2] Univ Technol Sydney, Fac Engn & Informat Technol, Sydney, Australia
[3] Macquarie Univ, Fac Human & Hlth Sci, Sch Med, Sydney, Australia
[4] Macquarie Univ Hosp, Sydney, Australia
[5] Univ Indonesia, Fac Med, Jakarta, Indonesia
[6] COWRA DIST HOSP, COWRA, Australia
[7] Madrid Hlth Serv, Hlth Ctr Los Pintores, Madrid, Spain
[8] Fdn Jimenez Diaz, Hlth Res Inst, Madrid, Spain
[9] Univ St Andrews, St Andrews, Scotland
[10] Macquarie Univ, Sch Comp, Sydney, Australia
来源
SCIENTIFIC REPORTS | 2025年 / 15卷 / 01期
关键词
Natural language processing; Electronic health records; Primary care; Artificial intelligence;
D O I
10.1038/s41598-024-84850-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
We assessed the performance of large language models' summarizing clinical dialogues using computational metrics and human evaluations. The comparison was done between automatically generated and human-produced summaries. We conducted an exploratory evaluation of five language models: one general summarisation model, one fine-tuned for general dialogues, two fine-tuned with anonymized clinical dialogues, and one Large Language Model (ChatGPT). These models were assessed using ROUGE, UniEval metrics, and expert human evaluation was done by clinicians comparing the generated summaries against a clinician generated summary (gold standard). The fine-tuned transformer model scored the highest when evaluated with ROUGE, while ChatGPT scored the lowest overall. However, using UniEval, ChatGPT scored the highest across all the evaluated domains (coherence 0.957, consistency 0.7583, fluency 0.947, and relevance 0.947 and overall score 0.9891). Similar results were obtained when the systems were evaluated by clinicians, with ChatGPT scoring the highest in four domains (coherency 0.573, consistency 0.908, fluency 0.96 and overall clinical use 0.862). Statistical analyses showed differences between ChatGPT and human summaries vs. all other models. These exploratory results indicate that ChatGPT's performance in summarizing clinical dialogues approached the quality of human summaries. The study also found that the ROUGE metrics may not be reliable for evaluating clinical summary generation, whereas UniEval correlated well with human ratings. Large language models may provide a successful path for automating clinical dialogue summarization. Privacy concerns and the restricted nature of health records remain challenges for its integration. Further evaluations using diverse clinical dialogues and multiple initialization seeds are needed to verify the reliability and generalizability of automatically generated summaries.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation
    Wysocka, Magdalena
    Wysocki, Oskar
    Delmas, Maxime
    Mutel, Vincent
    Freitas, Andre
    JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 158
  • [22] Large language models in medical ethics: useful but not expert
    Ferrario, Andrea
    Biller-Andorno, Nikola
    JOURNAL OF MEDICAL ETHICS, 2024, 50 (09) : 653 - 654
  • [23] Text Summarization in Aviation Safety: A Comparative Study of Large Language Models
    Emmons, Jonathan
    Sharma, Taneesha
    Salloum, Mariam
    Matthews, Bryan
    AIAA AVIATION FORUM AND ASCEND 2024, 2024,
  • [24] The Effect of Prompt Types on Text Summarization Performance With Large Language Models
    Borhan, Iffat
    Bajaj, Akhilesh
    JOURNAL OF DATABASE MANAGEMENT, 2024, 35 (01)
  • [25] On the Effectiveness of Large Language Models in Statement-level Code Summarization
    Zhu, Jie
    Miao, Yun
    Xu, Tingting
    Zhu, Junwu
    Sun, Xiaolei
    2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 216 - 227
  • [26] Evaluating the Factual Consistency of Large Language Models Through News Summarization
    Tam, Derek
    Mascarenhas, Anisha
    Zhang, Shiyue
    Kwan, Sarah
    Bansal, Mohit
    Raffel, Colin
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5220 - 5255
  • [27] Expert Evaluation of a Spoken Dialogue System in a Clinical Operating Room
    Miehle, Juliana
    Gerstenlauer, Nadine
    Ostler, Daniel
    Feussner, Hubertus
    Minker, Wolfgang
    Ultes, Stefan
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 735 - 740
  • [28] Leveraging large language models for abstractive summarization of Italian legal news
    Benedetto, Irene
    Cagliero, Luca
    Ferro, Michele
    Tarasconi, Francesco
    Bernini, Claudia
    Giacalone, Giuseppe
    ARTIFICIAL INTELLIGENCE AND LAW, 2025,
  • [29] Assessing the Impact of Prompt Strategies on Text Summarization with Large Language Models
    Onan, Aytug
    Alhumyani, Hesham
    COMPUTER APPLICATIONS IN INDUSTRY AND ENGINEERING, CAINE 2024, 2025, 2242 : 41 - 55
  • [30] Empirical Analysis of Dialogue Relation Extraction with Large Language Models
    Li, Guozheng
    Xu, Zijie
    Shang, Ziyu
    Liu, Jiajun
    Ji, Ke
    Guo, Yikai
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6359 - 6367