Toward expert-level medical question answering with large language models

被引:0
|
作者
Karan Singhal [1 ]
Tao Tu [1 ]
Juraj Gottweis [1 ]
Rory Sayres [1 ]
Ellery Wulczyn [1 ]
Mohamed Amin [1 ]
Le Hou [1 ]
Kevin Clark [2 ]
Stephen R. Pfohl [1 ]
Heather Cole-Lewis [1 ]
Darlene Neal [1 ]
Qazi Mamunur Rashid [1 ]
Mike Schaekermann [1 ]
Amy Wang [1 ]
Dev Dash [3 ]
Jonathan H. Chen [4 ]
Nigam H. Shah [5 ]
Sami Lachgar [6 ]
Philip Andrew Mansfield [7 ]
Sushant Prakash [8 ]
Bradley Green [1 ]
Ewa Dominowska [1 ]
Blaise Agüera y Arcas [1 ]
Nenad Tomašev [1 ]
Yun Liu [2 ]
Renee Wong [1 ]
Christopher Semturs [2 ]
S. Sara Mahdavi [1 ]
Joelle K. Barral [1 ]
Dale R. Webster [1 ]
Greg S. Corrado [2 ]
Yossi Matias [2 ]
Shekoofeh Azizi [1 ]
Alan Karthikesalingam [1 ]
Vivek Natarajan [1 ]
机构
[1] Google Research,Department of Emergency Medicine
[2] Google DeepMind,Stanford Center for Biomedical Informatics Research
[3] Stanford University School of Medicine,Division of Hospital Medicine
[4] Stanford University,Clinical Excellence Research Center
[5] Stanford University,Department of Medicine
[6] Stanford University,Technology and Digital Solutions
[7] Stanford University School of Medicine,undefined
[8] Stanford Healthcare,undefined
关键词
D O I
10.1038/s41591-024-03423-7
中图分类号
学科分类号
摘要
Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a ‘passing’ score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.
引用
收藏
页码:943 / 950
页数:7
相关论文
共 50 条
  • [21] A question-answering framework for automated abstract screening using large language models
    Akinseloyin, Opeoluwa
    Jiang, Xiaorui
    Palade, Vasile
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09)
  • [22] Review of Research Progress on Question-Answering Techniques Based on Large Language Models
    Wen, Sen
    Qian, Li
    Hu, Maodi
    Chang, Zhijun
    Data Analysis and Knowledge Discovery, 2024, 8 (06) : 16 - 29
  • [23] Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts
    Lan, Yunshi
    Li, Xiang
    Liu, Xin
    Li, Yang
    Qin, Wei
    Qian, Weining
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4389 - 4400
  • [24] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models
    van Sonsbeek, Tom
    Derakhshani, Mohammad Mahdi
    Najdenkoska, Ivona
    Snoek, Cees G. M.
    Worring, Marcel
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 726 - 736
  • [25] Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering
    Shao, Zhenwei
    Yu, Zhou
    Wang, Meng
    Yu, Jun
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14974 - 14983
  • [26] ZVQAF: Zero-shot visual question answering with feedback from large language models
    Liu, Cheng
    Wang, Chao
    Peng, Yan
    Li, Zhixu
    NEUROCOMPUTING, 2024, 580
  • [27] NurViD: A Large Expert-Level Video Database for Nursing Procedure Activity Understanding
    Hu, Ming
    Wang, Lin
    Yan, Siyuan
    Ma, Don
    Ren, Qingli
    Xia, Peng
    Feng, Wei
    Duan, Peibo
    Ju, Lie
    Ge, Zongyuan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [28] Efficient Question Answering Based on Language Models and Knowledge Graphs
    Li, Fengying
    Huang, Hongfei
    Dong, Rongsheng
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IV, 2023, 14257 : 340 - 351
  • [29] Language processing and learning models for community question answering in Arabic
    Romeo, Salvatore
    Da San Martino, Giovanni
    Belinkov, Yonatan
    Barron-Cedeno, Alberto
    Eldesouki, Mohamed
    Darwish, Kareem
    Mubarak, Hamdy
    Glass, James
    Moschitti, Alessandro
    INFORMATION PROCESSING & MANAGEMENT, 2019, 56 (02) : 274 - 290
  • [30] Unveiling the power of language models in chemical research question answering
    Xiuying Chen
    Tairan Wang
    Taicheng Guo
    Kehan Guo
    Juexiao Zhou
    Haoyang Li
    Zirui Song
    Xin Gao
    Xiangliang Zhang
    Communications Chemistry, 8 (1)