Toward expert-level medical question answering with large language models

被引:0
|
作者
Karan Singhal [1 ]
Tao Tu [1 ]
Juraj Gottweis [1 ]
Rory Sayres [1 ]
Ellery Wulczyn [1 ]
Mohamed Amin [1 ]
Le Hou [1 ]
Kevin Clark [2 ]
Stephen R. Pfohl [1 ]
Heather Cole-Lewis [1 ]
Darlene Neal [1 ]
Qazi Mamunur Rashid [1 ]
Mike Schaekermann [1 ]
Amy Wang [1 ]
Dev Dash [3 ]
Jonathan H. Chen [4 ]
Nigam H. Shah [5 ]
Sami Lachgar [6 ]
Philip Andrew Mansfield [7 ]
Sushant Prakash [8 ]
Bradley Green [1 ]
Ewa Dominowska [1 ]
Blaise Agüera y Arcas [1 ]
Nenad Tomašev [1 ]
Yun Liu [2 ]
Renee Wong [1 ]
Christopher Semturs [2 ]
S. Sara Mahdavi [1 ]
Joelle K. Barral [1 ]
Dale R. Webster [1 ]
Greg S. Corrado [2 ]
Yossi Matias [2 ]
Shekoofeh Azizi [1 ]
Alan Karthikesalingam [1 ]
Vivek Natarajan [1 ]
机构
[1] Google Research,Department of Emergency Medicine
[2] Google DeepMind,Stanford Center for Biomedical Informatics Research
[3] Stanford University School of Medicine,Division of Hospital Medicine
[4] Stanford University,Clinical Excellence Research Center
[5] Stanford University,Department of Medicine
[6] Stanford University,Technology and Digital Solutions
[7] Stanford University School of Medicine,undefined
[8] Stanford Healthcare,undefined
关键词
D O I
10.1038/s41591-024-03423-7
中图分类号
学科分类号
摘要
Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a ‘passing’ score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.
引用
收藏
页码:943 / 950
页数:7
相关论文
共 50 条
  • [1] Enabling GPTs for Expert-Level Environmental Engineering Question Answering
    Zhu, Jun-Jie
    Yang, Meiqi
    Jiang, Jinyue
    Bai, Yiming
    Chen, Danqi
    Ren, Zhiyong Jason
    Environmental Science and Technology Letters, 2024, 11 (12): : 1327 - 1333
  • [2] Reasoning with large language models for medical question answering
    Lucas, Mary M.
    Yang, Justin
    Pomeroy, Jon K.
    Yang, Christopher C.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09)
  • [3] MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering
    Alonso, Inigo
    Oronoz, Maite
    Agerri, Rodrigo
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 155
  • [4] A medical question answering system using large language models and knowledge graphs
    Guo, Quan
    Cao, Shuai
    Yi, Zhang
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (11) : 8548 - 8564
  • [5] Enhancing Biomedical Question Answering with Large Language Models
    Yang, Hua
    Li, Shilong
    Goncalves, Teresa
    INFORMATION, 2024, 15 (08)
  • [6] An astronomical question answering dataset for evaluating large language models
    Jie Li
    Fuyong Zhao
    Panfeng Chen
    Jiafu Xie
    Xiangrui Zhang
    Hui Li
    Mei Chen
    Yanhao Wang
    Ming Zhu
    Scientific Data, 12 (1)
  • [7] A General Approach to Website Question Answering with Large Language Models
    Ding, Yilang
    Nie, Jiawei
    Wu, Di
    Liu, Chang
    SOUTHEASTCON 2024, 2024, : 894 - 896
  • [8] Tree -of-Reasoning Question Decomposition for Complex Question Answering with Large Language Models
    Zhang, Kun
    Zeng, Jiali
    Meng, Fandong
    Wang, Yuanzhuo
    Sun, Shiqi
    Bai, Long
    Shen, Huawei
    Zhou, Jie
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19560 - 19568
  • [9] Chart Question Answering based on Modality Conversion and Large Language Models
    Liu, Yi-Cheng
    Chu, Wei-Ta
    PROCEEDINGS OF THE FIRST ACM WORKSHOP ON AI-POWERED QUESTION ANSWERING SYSTEMS FOR MULTIMEDIA, AIQAM 2024, 2024, : 19 - 24
  • [10] Large language models in medical ethics: useful but not expert
    Ferrario, Andrea
    Biller-Andorno, Nikola
    JOURNAL OF MEDICAL ETHICS, 2024,