Toward expert-level medical question answering with large language models

被引:0
|
作者
Karan Singhal [1 ]
Tao Tu [1 ]
Juraj Gottweis [1 ]
Rory Sayres [1 ]
Ellery Wulczyn [1 ]
Mohamed Amin [1 ]
Le Hou [1 ]
Kevin Clark [2 ]
Stephen R. Pfohl [1 ]
Heather Cole-Lewis [1 ]
Darlene Neal [1 ]
Qazi Mamunur Rashid [1 ]
Mike Schaekermann [1 ]
Amy Wang [1 ]
Dev Dash [3 ]
Jonathan H. Chen [4 ]
Nigam H. Shah [5 ]
Sami Lachgar [6 ]
Philip Andrew Mansfield [7 ]
Sushant Prakash [8 ]
Bradley Green [1 ]
Ewa Dominowska [1 ]
Blaise Agüera y Arcas [1 ]
Nenad Tomašev [1 ]
Yun Liu [2 ]
Renee Wong [1 ]
Christopher Semturs [2 ]
S. Sara Mahdavi [1 ]
Joelle K. Barral [1 ]
Dale R. Webster [1 ]
Greg S. Corrado [2 ]
Yossi Matias [2 ]
Shekoofeh Azizi [1 ]
Alan Karthikesalingam [1 ]
Vivek Natarajan [1 ]
机构
[1] Google Research,Department of Emergency Medicine
[2] Google DeepMind,Stanford Center for Biomedical Informatics Research
[3] Stanford University School of Medicine,Division of Hospital Medicine
[4] Stanford University,Clinical Excellence Research Center
[5] Stanford University,Department of Medicine
[6] Stanford University,Technology and Digital Solutions
[7] Stanford University School of Medicine,undefined
[8] Stanford Healthcare,undefined
关键词
D O I
10.1038/s41591-024-03423-7
中图分类号
学科分类号
摘要
Large language models (LLMs) have shown promise in medical question answering, with Med-PaLM being the first to exceed a ‘passing’ score in United States Medical Licensing Examination style questions. However, challenges remain in long-form medical question answering and handling real-world workflows. Here, we present Med-PaLM 2, which bridges these gaps with a combination of base LLM improvements, medical domain fine-tuning and new strategies for improving reasoning and grounding through ensemble refinement and chain of retrieval. Med-PaLM 2 scores up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19%, and demonstrates dramatic performance increases across MedMCQA, PubMedQA and MMLU clinical topics datasets. Our detailed human evaluations framework shows that physicians prefer Med-PaLM 2 answers to those from other physicians on eight of nine clinical axes. Med-PaLM 2 also demonstrates significant improvements over its predecessor across all evaluation metrics, particularly on new adversarial datasets designed to probe LLM limitations (P < 0.001). In a pilot study using real-world medical questions, specialists preferred Med-PaLM 2 answers to generalist physician answers 65% of the time. While specialist answers were still preferred overall, both specialists and generalists rated Med-PaLM 2 to be as safe as physician answers, demonstrating its growing potential in real-world medical applications.
引用
收藏
页码:943 / 950
页数:7
相关论文
共 50 条
  • [41] QUESTION ANSWERING SYSTEM ON MATHEMATICAL-MODELS (QAS) - DESCRIPTION OF LANGUAGE
    KONOPASEK, M
    PAPACONSTADOPOULOS, C
    COMPUTER LANGUAGES, 1978, 3 (03): : 145 - 155
  • [42] CORE-GPT: Combining Open Access Research and Large Language Models for Credible, Trustworthy Question Answering
    Pride, David
    Cancellieri, Matteo
    Knoth, Petr
    LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES, TPDL 2023, 2023, 14241 : 146 - 159
  • [43] UA-LLM: ADVANCING CONTEXT-BASED QUESTION ANSWERING IN UKRAINIAN THROUGH LARGE LANGUAGE MODELS
    Syromiatnikov, M., V
    Ruvinskaya, V. M.
    RADIO ELECTRONICS COMPUTER SCIENCE CONTROL, 2024, (01) : 147 - 160
  • [44] From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models
    Guo, Jiaxian
    Li, Junnan
    Li, Dongxu
    Tiong, Anthony Meng Huat
    Li, Boyang
    Tao, Dacheng
    Hoi, Steven
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10867 - 10877
  • [45] Assessing and Optimizing Large Language Models on Spondyloarthritis Multi-Choice Question Answering: Protocol for Enhancement and Assessment
    Wang, Anan
    Wu, Yunong
    Ji, Xiaojian
    Wang, Xiangyang
    Hu, Jiawen
    Zhang, Fazhan
    Zhang, Zhanchao
    Pu, Dong
    Tang, Lulu
    Ma, Shikui
    Liu, Qiang
    Dong, Jing
    He, Kunlun
    Li, Kunpeng
    Teng, Da
    Li, Tao
    JMIR RESEARCH PROTOCOLS, 2024, 13
  • [46] Medical Exam Question Answering with Large-Scale Reading Comprehension
    Zhang, Xiao
    Wu, Ji
    He, Zhiyang
    Liu, Xien
    Su, Ying
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 5706 - 5713
  • [47] A Critical Analysis of Benchmarks, Techniques, and Models in Medical Visual Question Answering
    Al-Hadhrami, Suheer
    Menai, Mohamed El Bachir
    Al-Ahmadi, Saad
    Alnafessah, Ahmed
    IEEE ACCESS, 2023, 11 : 136507 - 136540
  • [48] QUESTION ANSWERING FROM NATURAL-LANGUAGE MEDICAL DATA-BASES
    GRISHMAN, R
    HIRSCHMAN, L
    ARTIFICIAL INTELLIGENCE, 1978, 11 (1-2) : 25 - 43
  • [49] Accuracies of large language models in answering radiation protection questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    JOURNAL OF RADIOLOGICAL PROTECTION, 2024, 44 (02)
  • [50] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
    Maharjan, Jenish
    Garikipati, Anurag
    Singh, Navan Preet
    Cyrus, Leo
    Sharma, Mayank
    Ciobanu, Madalina
    Barnes, Gina
    Thapa, Rahul
    Mao, Qingqing
    Das, Ritankar
    SCIENTIFIC REPORTS, 2024, 14 (01):