Evaluating large language models on medical evidence summarization

被引:52
|
作者
Tang, Liyan [1 ]
Sun, Zhaoyi [2 ]
Idnay, Betina [3 ]
Nestor, Jordan G. [4 ]
Soroush, Ali [4 ]
Elias, Pierre A. [3 ]
Xu, Ziyang [5 ]
Ding, Ying [1 ]
Durrett, Greg [6 ]
Rousseau, Justin F. [7 ,8 ,9 ]
Weng, Chunhua [3 ]
Peng, Yifan [2 ]
机构
[1] Univ Texas Austin, Sch Informat, Austin, TX USA
[2] Weill Cornell Med, Dept Populat Hlth Sci, New York, NY 10065 USA
[3] Columbia Univ, Dept Biomed Informat, New York, NY 10027 USA
[4] Columbia Univ, Dept Med, New York, NY USA
[5] Massachusetts Gen Hosp, Dept Med, Boston, MA USA
[6] Univ Texas Austin, Dept Comp Sci, Austin, TX USA
[7] Univ Texas Austin, Dell Med Sch, Dept Populat Hlth, Austin, TX 78712 USA
[8] Univ Texas Austin, Dell Med Sch, Dept Neurol, Austin, TX 78712 USA
[9] Univ Texas Southwestern Med Ctr, Dept Neurol, Dallas, TX 75390 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
D O I
10.1038/s41746-023-00896-7
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Evaluating large language models on medical evidence summarization
    Liyan Tang
    Zhaoyi Sun
    Betina Idnay
    Jordan G. Nestor
    Ali Soroush
    Pierre A. Elias
    Ziyang Xu
    Ying Ding
    Greg Durrett
    Justin F. Rousseau
    Chunhua Weng
    Yifan Peng
    [J]. npj Digital Medicine, 6
  • [2] Closing the gap between open source and commercial large language models for medical evidence summarization
    Zhang, Gongbo
    Jin, Qiao
    Zhou, Yiliang
    Wang, Song
    Idnay, Betina
    Luo, Yiming
    Park, Elizabeth
    Nestor, Jordan G.
    Spotnitz, Matthew E.
    Soroush, Ali
    Campion Jr, Thomas R.
    Lu, Zhiyong
    Weng, Chunhua
    Peng, Yifan
    [J]. NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [3] Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
    Subbiah, Melanie
    Zhang, Sean
    Chilton, Lydia B.
    Mckeown, Kathleen
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 1290 - 1310
  • [4] Evaluating Large Language Models for Structured Science Summarization in the Open Research Knowledge Graph
    Nechakhin, Vladyslav
    D'Souza, Jennifer
    Eger, Steffen
    [J]. INFORMATION, 2024, 15 (06)
  • [5] Benchmarking Large Language Models for News Summarization
    Zhang, Tianyi
    Ladhak, Faisal
    Durmus, Esin
    Liang, Percy
    Mckeown, Kathleen
    Hashimoto, Tatsunori B.
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 39 - 57
  • [6] Adapted large language models can outperform medical experts in clinical text summarization
    Van Veen, Dave
    Van Uden, Cara
    Blankemeier, Louis
    Delbrouck, Jean-Benoit
    Aali, Asad
    Bluethgen, Christian
    Pareek, Anuj
    Polacin, Malgorzata
    Reis, Eduardo Pontes
    Seehofnerova, Anna
    Rohatgi, Nidhi
    Hosamani, Poonam
    Collins, William
    Ahuja, Neera
    Langlotz, Curtis P.
    Hom, Jason
    Gatidis, Sergios
    Pauly, John
    Chaudhari, Akshay S.
    [J]. NATURE MEDICINE, 2024, 30 (03) : 1134 - 1142
  • [7] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
    Cai, Yan
    Wang, Linlin
    Wang, Ye
    de Melo, Gerard
    Zhang, Ya
    Wang, Yanfeng
    He, Liang
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
  • [8] Evaluating the Summarization Comprehension of Pre-Trained Language Models
    D. I. Chernyshev
    B. V. Dobrov
    [J]. Lobachevskii Journal of Mathematics, 2023, 44 : 3028 - 3039
  • [9] Evaluating the Summarization Comprehension of Pre-Trained Language Models
    Chernyshev, D. I.
    Dobrov, B. V.
    [J]. LOBACHEVSKII JOURNAL OF MATHEMATICS, 2023, 44 (08) : 3028 - 3039
  • [10] Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases
    Chiu, Wan Hang Keith
    Ko, Wei Sum Koel
    Cho, William Chi Shing
    Hui, Sin Yu Joanne
    Chan, Wing Chi Lawrence
    Kuo, Michael D.
    [J]. JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26