Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam

被引:0
|
作者
Builoff, Valerie [1 ]
Shanbhag, Aakash [1 ,2 ]
Miller, Robert J. H. [1 ,3 ]
Dey, Damini [1 ]
Liang, Joanna X. [1 ]
Flood, Kathleen [4 ]
Bourque, Jamieson M. [5 ]
Chareonthaitawee, Panithaya [6 ]
Phillips, Lawrence M. [7 ]
Slomka, Piotr J. [1 ]
机构
[1] Cedars Sinai Med Ctr, Dept Med, Div Artificial Intelligence Med, Imaging & Biomed Sci, Los Angeles, CA 90048 USA
[2] Univ Southern Calif, Signal & Image Proc Inst, Ming Hsieh Dept Elect & Comp Engn, Los Angeles, CA USA
[3] Univ Calgary, Dept Cardiac Sci, Calgary, AB, Canada
[4] Amer Soc Nucl Cardiol, Fairfax, VA USA
[5] Univ Virginia Hlth Syst, Div Cardiovasc Med & Radiol, Charlottesville, VA USA
[6] Mayo Clin, Dept Cardiovasc Med, Rochester, MN USA
[7] NYU Grossman Sch Med, Dept Med, Leon H Charney Div Cardiol, New York, NY USA
基金
美国国家卫生研究院;
关键词
Nuclear cardiology board exam; Large language models; GPT; Cardiovascular imaging questions; PERFORMANCE;
D O I
10.1016/j.nuclcard.2024.102089
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions. Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4%- 58.0%), 40.5% (39.9%- 42.9%), 60.7% (59.5% 61.3%), and 63.1% (62.5%e64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all). Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT4o shows potential to support physicians in answering text-based clinical questions.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Evaluating large language models for software testing
    Li, Yihao
    Liu, Pan
    Wang, Haiyang
    Chu, Jie
    Wong, W. Eric
    COMPUTER STANDARDS & INTERFACES, 2025, 93
  • [22] Evaluating Intelligence and Knowledge in Large Language Models
    Bianchini, Francesco
    TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
  • [23] Evaluating large language models as agents in the clinic
    Mehandru, Nikita
    Miao, Brenda Y.
    Almaraz, Eduardo Rodriguez
    Sushil, Madhumita
    Butte, Atul J.
    Alaa, Ahmed
    NPJ DIGITAL MEDICINE, 2024, 7 (01)
  • [24] Large Language Models and Generative AI, Oh My!
    Zyda, Michael
    COMPUTER, 2024, 57 (03) : 127 - 132
  • [25] LAraBench: Benchmarking Arabic AI with Large Language Models
    Abdelali, Ahmed
    Mubarak, Hamdy
    Chowdhury, Shammur Absar
    Hasanain, Maram
    Mousi, Basel
    Boughorbel, Sabri
    Abdaljalil, Samir
    El Kheir, Yassine
    Izham, Daniel
    Dalvi, Fahim
    Hawasly, Majd
    Nazar, Nizi
    Elshahawy, Yousseif
    Ali, Ahmed
    Durrani, Nadir
    Milic-Frayling, Natasa
    Alam, Firoj
    PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 487 - 520
  • [26] Neurosymbolic AI Approach to Attribution in Large Language Models
    Tilwani, Deepa
    Venkataramanan, Revathy
    Sheth, Amit P.
    IEEE INTELLIGENT SYSTEMS, 2024, 39 (06) : 10 - 17
  • [27] The promise of AI Large Language Models for Epilepsy care
    Landais, Raphaelle
    Sultan, Mustafa
    Thomas, Rhys H.
    EPILEPSY & BEHAVIOR, 2024, 154
  • [28] LAraBench: Benchmarking Arabic AI with Large Language Models
    Qatar Computing Research Institute, HBKU, Qatar
    不详
    arXiv, 1600,
  • [29] AI Computing Systems for Large Language Models Training
    Zhang, Zhen-Xing
    Wen, Yuan-Bo
    Lyu, Han-Qi
    Liu, Chang
    Zhang, Rui
    Li, Xia-Qing
    Wang, Chao
    Du, Zi-Dong
    Guo, Qi
    Li, Ling
    Zhou, Xue-Hai
    Chen, Yun-Ji
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2025, 40 (01) : 6 - 41
  • [30] Large language models make AI usable for everyone!
    Bause, Fabian
    Konstruktion, 2024, 76 (04): : 3 - 5