Evaluating large language models as agents in the clinic

被引:0
|
作者
Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
机构
[1] University of California,Bakar Computational Health Sciences Institute
[2] Berkeley,Neurosurgery Department Division of Neuro
[3] University of California San Francisco,Oncology
[4] University of California San Francisco,Department of Epidemiology and Biostatistics
[5] University of California San Francisco,Department of Pediatrics
[6] University of California San Francisco,undefined
来源
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.
引用
收藏
相关论文
共 50 条
  • [21] A Chinese Dataset for Evaluating the Safeguards in Large Language Models
    Wang, Yuxia
    Zhai, Zenan
    Li, Haonan
    Han, Xudong
    Lin, Lizhi
    Zhang, Zhenxuan
    Zhao, Jingru
    Nakov, Preslav
    Baldwin, Timothy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 3106 - 3119
  • [22] Evaluating large language models in analysing classroom dialogue
    Long, Yun
    Luo, Haifeng
    Zhang, Yu
    NPJ SCIENCE OF LEARNING, 2024, 9 (01)
  • [23] Evaluating large language models in theory of mind tasks
    Kosinski, Michal
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (45)
  • [24] DebugBench: Evaluating Debugging Capability of Large Language Models
    Tian, Runchu
    Ye, Yining
    Qin, Yujia
    Cong, Xin
    Lin, Yankai
    Pan, Yinxu
    Wu, Yesai
    Hui, Haotian
    Liu, Weichuan
    Liu, Zhiyuan
    Sun, Maosong
    Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024, : 4173 - 4198
  • [25] Large language models (LLMs) as agents for augmented democracy
    Gudino, Jairo F.
    Grandi, Umberto
    Hidalgo, Cesar
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2024, 382 (2285):
  • [26] Conversational Agents for Dementia using Large Language Models
    Favela, Jesus
    Cruz-Sandoval, Dagoberto
    Parra, Mario O.
    2023 MEXICAN INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE, ENC, 2024,
  • [27] A review of large language models and autonomous agents in chemistry
    Ramos, Mayk Caldas
    Collison, Christopher J.
    White, Andrew D.
    CHEMICAL SCIENCE, 2025, 16 (06) : 2514 - 2572
  • [28] Evaluating the Performance of Large Language Models for Spanish Language in Undergraduate Admissions Exams
    Miranda, Sabino
    Pichardo-Lagunas, Obdulia
    Martinez-Seis, Bella
    Baldi, Pierre
    COMPUTACION Y SISTEMAS, 2023, 27 (04): : 1241 - 1248
  • [29] Evaluating and Mitigating Gender Bias in Generative Large Language Models
    Zhou, H.
    Inkpen, D.
    Kantarci, B.
    INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL, 2024, 19 (06)
  • [30] A dataset for evaluating clinical research claims in large language models
    Zhang, Boya
    Bornet, Alban
    Yazdani, Anthony
    Khlebnikov, Philipp
    Milutinovic, Marija
    Rouhizadeh, Hossein
    Amini, Poorya
    Teodoro, Douglas
    SCIENTIFIC DATA, 2025, 12 (01)