Performance of Publicly Available Large Language Models on Internal Medicine Board-style Questions

被引:2
|
作者
Tarabanis, Constantine [1 ]
Zahid, Sohail [1 ]
Mamalis, Marios [2 ]
Zhang, Kevin [3 ]
Kalampokis, Evangelos [2 ]
Jankelson, Lior [1 ]
机构
[1] NYU, Sch Med, Leon H Charney Div Cardiol, NYU Langone Hlth, New York, NY 10012 USA
[2] Univ Macedonia, Informat Syst Lab, Thessaloniki, Greece
[3] NYU, Sch Med, Dept Med, NYU Langone Hlth, New York, NY USA
来源
PLOS DIGITAL HEALTH | 2024年 / 3卷 / 09期
关键词
D O I
10.1371/journal.pdig.0000604
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Ongoing research attempts to benchmark large language models (LLM) against physicians' fund of knowledge by assessing LLM performance on medical examinations. No prior study has assessed LLM performance on internal medicine (IM) board examination questions. Limited data exists on how knowledge supplied to the models, derived from medical texts improves LLM performance. The performance of GPT-3.5, GPT-4.0, LaMDA and Llama 2, with and without additional model input augmentation, was assessed on 240 randomly selected IM board-style questions. Questions were sourced from the Medical Knowledge Self-Assessment Program released by the American College of Physicians with each question serving as part of the LLM prompt. When available, LLMs were accessed both through their application programming interface (API) and their corresponding chatbot. Mode inputs were augmented with Harrison's Principles of Internal Medicine using the method of Retrieval Augmented Generation. LLM-generated explanations to 25 correctly answered questions were presented in a blinded fashion alongside the MKSAP explanation to an IM board-certified physician tasked with selecting the human generated response. GPT-4.0, accessed either through Bing Chat or its API, scored 77.5-80.7% outperforming GPT-3.5, human respondents, LaMDA and Llama 2 in that order. GPT-4.0 outperformed human MKSAP users on every tested IM subject with its highest and lowest percentile scores in Infectious Disease (80th) and Rheumatology (99.7th), respectively. There is a 3.2-5.3% decrease in performance of both GPT-3.5 and GPT-4.0 when accessing the LLM through its API instead of its online chatbot. There is 4.5-7.5% increase in performance of both GPT3.5 and GPT-4.0 accessed through their APIs after additional input augmentation. The blinded reviewer correctly identified the human generated MKSAP response in 72% of the 25-question sample set. GPT-4.0 performed best on IM board-style questions outperforming human respondents. Augmenting with domain-specific information improved performance rendering Retrieval Augmented Generation a possible technique for improving accuracy in medical examination LLM responses.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Performance of Generative Large Language Models on Ophthalmology Board-Style Questions
    Cai, Louis Z.
    Shaheen, Abdulla
    Jin, Andrew
    Fukui, Riya
    Yi, Jonathan S.
    Yannuzzi, Nicolas
    Alabiad, Chrisfouad
    AMERICAN JOURNAL OF OPHTHALMOLOGY, 2023, 254 : 141 - 149
  • [2] Performance of Large Language Models on a Neurology Board-Style Examination
    Schubert, Marc Cicero
    Wick, Wolfgang
    Venkataramani, Varun
    JAMA NETWORK OPEN, 2023, 6 (12) : E2346721
  • [3] Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models
    Khan, Adnan A.
    Yunus, Rayaan
    Sohail, Mahad
    Rehman, Taha A.
    Saeed, Shirin
    Bu, Yifan
    Jackson, Cullen D.
    Sharkey, Aidan
    Mahmood, Feroze
    Matyal, Robina
    JOURNAL OF CARDIOTHORACIC AND VASCULAR ANESTHESIA, 2024, 38 (05) : 1251 - 1259
  • [4] Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions
    Mistry, Neel P.
    Saeed, Huzaifa
    Rafique, Sidra
    Le, Thuy
    Obaid, Haron
    Adams, Scott J.
    ACADEMIC RADIOLOGY, 2024, 31 (09) : 3872 - 3878
  • [5] Accuracy of large language models in answering ophthalmology board-style questions: A meta-analysis
    Wu, Jo-Hsuan
    Nishida, Takashi
    Liu, T. Y. Alvin
    ASIA-PACIFIC JOURNAL OF OPHTHALMOLOGY, 2024, 13 (05):
  • [6] Comment on: Performance of Generative Large Language Models on Ophthalmology Board Style Questions
    Kleebayoon, Amnuay
    Wiwanitkit, Viroj
    AMERICAN JOURNAL OF OPHTHALMOLOGY, 2023, 256 : 200 - 200
  • [7] Llama 3 Challenges Proprietary State-of-the-Art Large Language Models in Radiology Board-style Examination Questions
    Adams, Lisa C.
    Truhn, Daniel
    Busch, Felix
    Dorfner, Felix
    Nawabi, Jawed
    Makowski, Marcus R.
    Bressem, Keno K.
    RADIOLOGY, 2024, 312 (02)
  • [8] Artificial Intelligence Showdown in Gastroenterology: A Comparative Analysis of Large Language Models (LLMs) in Tackling Board-Style Review Questions
    Shah, Kevin P.
    Dey, Shirin A.
    Pothula, Shravya
    Abud, Arnold
    Jain, Sukrit
    Srivastava, Aniruddha
    Dommaraju, Sagar
    Komanduri, Srinadh
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1567 - S1568
  • [9] Performance of large language models on a neurology board-style examination (vol 6, e2346721, 2023)
    Schubert, M. C.
    Wick, W.
    JAMA NETWORK OPEN, 2024, 7 (01)
  • [10] Swiss general internal medicine board examination: quantitative effects of publicly available and unavailable questions on question difficulty and test performance
    Pedrini, Petra Ferrari
    Berendonk, Christoph
    Roussy, Anne Ehle
    Gabutti, Luca
    Hugentobler, Thomas
    Kung, Lilian
    Muggli, Franco
    Neubauer, Florian
    Ritter, Simon
    Ronga, Alexandre
    Rothenbuhler, Andreas
    Savopol, Monique
    Spath, Hansueli
    Stricker, Daniel
    Widmer, Daniel
    Stoller, Ulrich
    Beer, Jurg Hans
    SWISS MEDICAL WEEKLY, 2022, 152