Evaluation and mitigation of the limitations of large language models in clinical decision-making

被引:43
|
作者
Hager, Paul [1 ,2 ]
Jungmann, Friederike [1 ,2 ]
Holland, Robbie [3 ]
Bhagat, Kunal [4 ]
Hubrecht, Inga [5 ]
Knauer, Manuel [5 ]
Vielhauer, Jakob [6 ]
Makowski, Marcus [2 ]
Braren, Rickmer [2 ]
Kaissis, Georgios [1 ,2 ,3 ,7 ]
Rueckert, Daniel [1 ,3 ]
机构
[1] Tech Univ Munich, Klinikum Rechts Isar, Inst AI & Informat, Munich, Germany
[2] Tech Univ Munich, Inst Diagnost & Intervent Radiol, Klinikum Rechts Isar, Munich, Germany
[3] Imperial Coll, Dept Comp, London, England
[4] ChristianaCare Hlth Syst, Dept Med, Wilmington, DE USA
[5] Tech Univ Munich, Dept Med 3, Klinikum Rechts Isar, Munich, Germany
[6] Ludwig Maximilian Univ Munich, Dept Med 2, Univ Hosp, Munich, Germany
[7] Helmholtz Munich, Inst Machine Learning Biomed Imaging, Reliable AI Grp, Munich, Germany
关键词
AI; BIAS;
D O I
10.1038/s41591-024-03097-1
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Clinical decision-making is one of the most impactful parts of a physician's responsibilities and stands to benefit greatly from artificial intelligence solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows. Here we have created a curated dataset based on the Medical Information Mart for Intensive Care database spanning 2,400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies. Using a curated dataset of 2,400 cases and a framework to simulate a realistic clinical setting, current large language models are shown to incur substantial pitfalls when used for autonomous clinical decision-making.
引用
收藏
页码:2613 / 2622
页数:26
相关论文
共 50 条
  • [41] Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study
    Ke, Yuhe
    Yang, Rui
    Lie, Sui An
    Lim, Taylor Xin Yi
    Ning, Yilin
    Li, Irene
    Abdullah, Hairil Rizal
    Ting, Daniel Shu Wei
    Liu, Nan
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [42] Optimized Large Language Models Versus Multiple Sclerosis Specialists: Evaluating Answering Questions of Clinical Decision-Making, A Comparative Study based on clinical scenarios
    Inojosa, Hernan
    Weicken, Eva
    Voigt, Isabel
    Wenk, Judith
    Wiest, Isabella
    Ferber, Dyke
    Gilbert, Stephen
    Kather, Jakob
    Akguen, Katja
    Ziemssen, Tjalf
    MULTIPLE SCLEROSIS JOURNAL, 2024, 30 (03) : 999 - 1000
  • [43] Escalation Risks from Language Models in Military and Diplomatic Decision-Making
    Rivera, Juan-Pablo
    Mukobi, Gabriel
    Reuel, Anka
    Lamparth, Max
    Smith, Chandler
    Schneider, Jacquelyn
    PROCEEDINGS OF THE 2024 ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, ACM FACCT 2024, 2024, : 836 - 898
  • [44] PROCESS MODELS OF DECISION-MAKING
    HARTE, JM
    WESTENBERG, MRM
    VANSOMEREN, M
    ACTA PSYCHOLOGICA, 1994, 87 (2-3) : 95 - 120
  • [45] MODELS FOR PARTICIPATION IN DECISION-MAKING
    NIEDENHOFF, HU
    POLITISCHE STUDIEN, 1975, 26 (224) : 575 - 590
  • [46] Soft models in decision-making
    Batyrshin, I
    Pospelov, D
    INTERNATIONAL JOURNAL OF GENERAL SYSTEMS, 2001, 30 (01) : 1 - 2
  • [47] LINGUISTIC DECISION-MAKING MODELS
    DELGADO, M
    VERDEGAY, JL
    VILA, MA
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 1992, 7 (05) : 479 - 492
  • [48] ENERGY MODELS FOR DECISION-MAKING
    ORMEROD, RJ
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 1980, 5 (06) : 366 - 377
  • [49] Epilogue: Models of decision-making
    Altman, J
    NEUROBIOLOGY OF DECISION-MAKING, 1996, : 201 - 206
  • [50] Fuzzy Models in Decision-Making
    Gherasim, Ovidiu
    2008 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING CONTROL & AUTOMATION, VOLS 1 AND 2, 2008, : 958 - 962