Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

被引:2
|
作者
Tarabanis, Constantine [1 ]
Zahid, Sohail [1 ]
Mamalis, Marios [2 ]
Zhang, Kevin [3 ]
Kalampokis, Evangelos [2 ]
Jankelson, Lior [1 ]
机构
[1] NYU, Sch Med, Leon H Charney Div Cardiol, NYU Langone Hlth, New York, NY 10012 USA
[2] Univ Macedonia, Informat Syst Lab, Thessaloniki, Greece
[3] NYU, Sch Med, Dept Med, NYU Langone Hlth, New York, NY USA
来源
PLOS DIGITAL HEALTH | 2024年 / 3卷 / 09期
关键词
D O I
10.1371/journal.pdig.0000604
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] Performance of large language model artificial intelligence on dermatology board exam questions
    Park, Lily
    Ehlert, Brittany
    Susla, Lyudmyla
    Lum, Zachary C.
    Lee, Patrick K.
    CLINICAL AND EXPERIMENTAL DERMATOLOGY, 2023, 49 (07) : 733 - 734
  • [32] Evolution of publicly available large language models for complex decision-making in breast cancer care
    Griewing, Sebastian
    Knitza, Johannes
    Boekhoff, Jelena
    Hillen, Christoph
    Lechner, Fabian
    Wagner, Uwe
    Wallwiener, Markus
    Kuhn, Sebastian
    ARCHIVES OF GYNECOLOGY AND OBSTETRICS, 2024, 310 (01) : 537 - 550
  • [33] Large language models in internal medicine residency: current use and attitudes among internal medicine residents
    Aaron J. Fried
    Spencer D. Dorn
    William J. Leland
    Emily Mullen
    Donna M. Williams
    Aimee K. Zaas
    Jack MacGuire
    Debra L. Bynum
    Discover Artificial Intelligence, 4 (1):
  • [34] Large language models in medicine
    Thirunavukarasu, Arun James
    Ting, Darren Shu Jeng
    Elangovan, Kabilan
    Gutierrez, Laura
    Tan, Ting Fang
    Ting, Daniel Shu Wei
    NATURE MEDICINE, 2023, 29 (08) : 1930 - 1940
  • [35] Large language models in medicine
    Arun James Thirunavukarasu
    Darren Shu Jeng Ting
    Kabilan Elangovan
    Laura Gutierrez
    Ting Fang Tan
    Daniel Shu Wei Ting
    Nature Medicine, 2023, 29 : 1930 - 1940
  • [36] Performance of a Large Language Model on Japanese Emergency Medicine Board Certification Examinations
    Igarashi, Yutaka
    Nakahara, Kyoichi
    Norii, Tatsuya
    Miyake, Nodoka
    Tagami, Takashi
    Yokobori, Shoji
    JOURNAL OF NIPPON MEDICAL SCHOOL, 2024, 91 (02) : 155 - 161
  • [37] Accuracy and consistency of publicly available Large Language Models as clinical decision support tools for the management of colon cancer
    Kaiser, Kristen N.
    Hughes, Alexa J.
    Yang, Anthony D.
    Turk, Anita A.
    Mohanty, Sanjay
    Gonzalez, Andrew A.
    Patzer, Rachel E.
    Bilimoria, Karl Y.
    Ellis, Ryan J.
    JOURNAL OF SURGICAL ONCOLOGY, 2024,
  • [38] Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023
    Khalpey, Zain
    Kumar, Ujjawal
    King, Nicholas
    Abraham, Alyssa
    Khalpey, Amina H.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (07)
  • [39] The active board of directors and performance of the large publicly traded corporation
    Millstein, IM
    MacAvoy, PW
    COLUMBIA LAW REVIEW, 1998, 98 (05) : 1283 - 1321
  • [40] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
    Ballard, David H.
    RADIOLOGY, 2024, 311 (02)