Evaluating large language models on medical evidence summarization

被引：52

作者：

Tang, Liyan ^{[1
]}

Sun, Zhaoyi ^{[2
]}

Idnay, Betina ^{[3
]}

Nestor, Jordan G. ^{[4
]}

Soroush, Ali ^{[4
]}

Elias, Pierre A. ^{[3
]}

Xu, Ziyang ^{[5
]}

Ding, Ying ^{[1
]}

Durrett, Greg ^{[6
]}

Rousseau, Justin F. ^{[7
,8
,9
]}

Weng, Chunhua ^{[3
]}

Peng, Yifan ^{[2
]}

机构：

[1] Univ Texas Austin, Sch Informat, Austin, TX USA

[2] Weill Cornell Med, Dept Populat Hlth Sci, New York, NY 10065 USA

[3] Columbia Univ, Dept Biomed Informat, New York, NY 10027 USA

[4] Columbia Univ, Dept Med, New York, NY USA

[5] Massachusetts Gen Hosp, Dept Med, Boston, MA USA

[6] Univ Texas Austin, Dept Comp Sci, Austin, TX USA

[7] Univ Texas Austin, Dell Med Sch, Dept Populat Hlth, Austin, TX 78712 USA

[8] Univ Texas Austin, Dell Med Sch, Dept Neurol, Austin, TX 78712 USA

[9] Univ Texas Southwestern Med Ctr, Dept Neurol, Dallas, TX 75390 USA

来源：

NPJ DIGITAL MEDICINE | 2023年 / 6卷 / 01期

基金：

美国国家卫生研究院; 美国国家科学基金会;

关键词：

D O I：

10.1038/s41746-023-00896-7

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.

引用

页数：8

共 50 条

[1] Evaluating large language models on medical evidence summarization
Liyan Tang
Zhaoyi Sun
Betina Idnay
Jordan G. Nestor
Ali Soroush
Pierre A. Elias
Ziyang Xu
Ying Ding
Greg Durrett
Justin F. Rousseau
Chunhua Weng
Yifan Peng
[J]. npj Digital Medicine, 6
[2] Closing the gap between open source and commercial large language models for medical evidence summarization
Zhang, Gongbo
Jin, Qiao
Zhou, Yiliang
Wang, Song
Idnay, Betina
Luo, Yiming
Park, Elizabeth
Nestor, Jordan G.
Spotnitz, Matthew E.
Soroush, Ali
Campion Jr, Thomas R.
Lu, Zhiyong
Weng, Chunhua
Peng, Yifan
[J]. NPJ DIGITAL MEDICINE, 2024, 7 (01):
[3] Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
Subbiah, Melanie
Zhang, Sean
Chilton, Lydia B.
Mckeown, Kathleen
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 1290 - 1310
[4] Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph
Nechakhin, Vladyslav
D'Souza, Jennifer
Eger, Steffen
[J]. INFORMATION, 2024, 15 (06)
[5] Benchmarking Large Language Models for News Summarization
Zhang, Tianyi
Ladhak, Faisal
Durmus, Esin
Liang, Percy
Mckeown, Kathleen
Hashimoto, Tatsunori B.
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 39 - 57
[6] Adapted large language models can outperform medical experts in clinical text summarization
Van Veen, Dave
Van Uden, Cara
Blankemeier, Louis
Delbrouck, Jean-Benoit
Aali, Asad
Bluethgen, Christian
Pareek, Anuj
Polacin, Malgorzata
Reis, Eduardo Pontes
Seehofnerova, Anna
Rohatgi, Nidhi
Hosamani, Poonam
Collins, William
Ahuja, Neera
Langlotz, Curtis P.
Hom, Jason
Gatidis, Sergios
Pauly, John
Chaudhari, Akshay S.
[J]. NATURE MEDICINE, 2024, 30 (03) : 1134 - 1142
[7] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
Cai, Yan
Wang, Linlin
Wang, Ye
de Melo, Gerard
Zhang, Ya
Wang, Yanfeng
He, Liang
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
[8] Evaluating the Summarization Comprehension of Pre-Trained Language Models
D. I. Chernyshev
B. V. Dobrov
[J]. Lobachevskii Journal of Mathematics, 2023, 44 : 3028 - 3039
[9] Evaluating the Summarization Comprehension of Pre-Trained Language Models
Chernyshev, D. I.
Dobrov, B. V.
[J]. LOBACHEVSKII JOURNAL OF MATHEMATICS, 2023, 44 (08) : 3028 - 3039
[10] Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases
Chiu, Wan Hang Keith
Ko, Wei Sum Koel
Cho, William Chi Shing
Hui, Sin Yu Joanne
Chan, Wing Chi Lawrence
Kuo, Michael D.
[J]. JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26

← 1 2 3 4 5 →