Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

被引:2
|
作者
Tarabanis, Constantine [1 ]
Zahid, Sohail [1 ]
Mamalis, Marios [2 ]
Zhang, Kevin [3 ]
Kalampokis, Evangelos [2 ]
Jankelson, Lior [1 ]
机构
[1] NYU, Sch Med, Leon H Charney Div Cardiol, NYU Langone Hlth, New York, NY 10012 USA
[2] Univ Macedonia, Informat Syst Lab, Thessaloniki, Greece
[3] NYU, Sch Med, Dept Med, NYU Langone Hlth, New York, NY USA
来源
PLOS DIGITAL HEALTH | 2024年 / 3卷 / 09期
关键词
D O I
10.1371/journal.pdig.0000604
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions
    Abbas, Ali
    Rehman, Mahad S.
    Rehman, Syed S.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)
  • [22] Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations
    Bhayana, Rajesh
    Krishna, Satheesh
    Bleakney, Robert R.
    RADIOLOGY, 2023, 307 (05)
  • [23] Automated anonymization of radiology reports: comparison of publicly available natural language processing and large language models
    Langenbach, Marcel C.
    Foldyna, Borek
    Hadzic, Ibrahim
    Langenbach, Isabel L.
    Raghu, Vineet K.
    Lu, Michael T.
    Neilan, Tomas G.
    Heemelaar, Julius C.
    EUROPEAN RADIOLOGY, 2024,
  • [24] OpenAI's GPT-4 performs to a high degree on board-style dermatology questions
    Elias, Marcus L.
    Burshtein, Joshua
    Sharon, Victoria R.
    INTERNATIONAL JOURNAL OF DERMATOLOGY, 2024, 63 (01) : 73 - 78
  • [25] ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions
    Hopkins, Benjamin S.
    Nguyen, Vincent N.
    Dallas, Jonathan
    Texakalidis, Pavlos
    Yang, Max
    Renn, Alex
    Guerra, Gage
    Kashif, Zain
    Cheok, Stephanie
    Zada, Gabriel
    Mack, William J.
    JOURNAL OF NEUROSURGERY, 2023, 139 (03) : 904 - 911
  • [26] Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination
    Beam, Kristyn
    Sharma, Puneet
    Kumar, Bhawesh
    Wang, Cindy
    Brodsky, Dara
    Martin, Camilia R.
    Beam, Andrew
    JAMA PEDIATRICS, 2023, 177 (09) : 977 - 979
  • [27] MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare
    Ji, Shaoxiong
    Zhang, Tianlin
    Ansari, Luna
    Fu, Jie
    Tiwari, Prayag
    Cambria, Erik
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7184 - 7190
  • [28] Performance of Three Large Language Models on Dermatology Board Examinations
    Mirza, Fatima N.
    Lim, Rachel K.
    Yumeen, Sara
    Wahood, Samer
    Zaidat, Bashar
    Shah, Asghar
    Tang, Oliver Y.
    Kawaoka, John
    Seo, Su-Jean
    Dimarco, Christopher
    Muglia, Jennie
    Goldbach, Hayley S.
    Wisco, Oliver
    Qureshi, Abrar A.
    Libby, Tiffany J.
    JOURNAL OF INVESTIGATIVE DERMATOLOGY, 2024, 144 (02) : 398 - 400
  • [29] Performance of Large Language Models on Medical Oncology Examination Questions
    Longwell, Jack B.
    Hirsch, Ian
    Binder, Fernando
    Conchas, Galileo Arturo Gonzalez
    Mau, Daniel
    Jang, Raymond
    Krishnan, Rahul G.
    Grant, Robert C.
    JAMA NETWORK OPEN, 2024, 7 (06) : e2417641
  • [30] Assigning Commercial Off-The-Shelf (COTS)-Based Board-Style Pharmacology Practice Questions Medical Students
    Ghayur, Muhammad N.
    Ghayur, Ayesha
    JOURNAL OF PHARMACOLOGY AND EXPERIMENTAL THERAPEUTICS, 2025, 392 (03):