Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

被引:2
|
作者
Tarabanis, Constantine [1 ]
Zahid, Sohail [1 ]
Mamalis, Marios [2 ]
Zhang, Kevin [3 ]
Kalampokis, Evangelos [2 ]
Jankelson, Lior [1 ]
机构
[1] NYU, Sch Med, Leon H Charney Div Cardiol, NYU Langone Hlth, New York, NY 10012 USA
[2] Univ Macedonia, Informat Syst Lab, Thessaloniki, Greece
[3] NYU, Sch Med, Dept Med, NYU Langone Hlth, New York, NY USA
来源
PLOS DIGITAL HEALTH | 2024年 / 3卷 / 09期
关键词
D O I
10.1371/journal.pdig.0000604
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] Comparative performance of artificial ıntelligence models in physical medicine and rehabilitation board-level questions
    Menek, Ahmet Kivanc
    Is, Enes Efe
    REVISTA DA ASSOCIACAO MEDICA BRASILEIRA, 2024, 70 (07):
  • [42] Performance of Large Language Models ChatGPT and Gemini on Workplace Management Questions in Radiology
    Leutz-Schmidt, Patricia
    Palm, Viktoria
    Mathy, Rene Michael
    Groezinger, Martin
    Kauczor, Hans-Ulrich
    Jang, Hyungseok
    Sedaghat, Sam
    DIAGNOSTICS, 2025, 15 (04)
  • [43] Comparison of Performance of Large Language Models on Lung-RADS Related Questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    JCO GLOBAL ONCOLOGY, 2024, 10
  • [44] Performance of large language models on benign prostatic hyperplasia frequently asked questions
    Zhang, YuNing
    Dong, Yijie
    Mei, Zihan
    Hou, Yiqing
    Wei, Minyan
    Yeung, Yat Hin
    Xu, Jiale
    Hua, Qing
    Lai, LiMei
    Li, Ning
    Xia, ShuJun
    Zhou, Chun
    Zhou, JianQiao
    PROSTATE, 2024, 84 (09): : 807 - 813
  • [45] Diagnostic Accuracy of Large Language Models in the European Board of Interventional Radiology Examination (EBIR) Sample Questions
    Gunes, Yasin Celal
    Cesur, Turay
    CARDIOVASCULAR AND INTERVENTIONAL RADIOLOGY, 2024, 47 (06) : 836 - 837
  • [46] Exploring the Potential and Limitations of Chat Generative Pre-trained Transformer (ChatGPT) in Generating Board-Style Dermatology Questions: A Qualitative Analysis
    Ayub, Ibraheim
    Hamann, Dathan
    Hamann, Carsten R.
    Davis, Matthew J.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [47] Large language models for medicine: a survey
    Zheng, Yanxin
    Gan, Wensheng
    Chen, Zefeng
    Qi, Zhenlian
    Liang, Qian
    Yu, Philip S.
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2025, 16 (02) : 1015 - 1040
  • [48] Large language models for science and medicine
    Telenti, Amalio
    Auli, Michael
    Hie, Brian L.
    Maher, Cyrus
    Saria, Suchi
    Ioannidis, John P. A.
    EUROPEAN JOURNAL OF CLINICAL INVESTIGATION, 2024, 54 (06)
  • [49] Military internal medicine resident performance on the American Board of Internal Medicine certifying examination
    Cation, LJ
    Durning, SJ
    Gutierrez-Nunez, JJ
    MILITARY MEDICINE, 2002, 167 (05) : 421 - 423
  • [50] Associations between Doximity internal medicine residency navigator reputation rank and publicly available metrics
    Stephenson, Christopher R.
    Mandrekar, Jayawant N.
    Beckman, Thomas J.
    Wittich, Christopher M.
    BMC MEDICAL EDUCATION, 2025, 25 (01)