Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam

被引：0

作者：

Builoff, Valerie ^{[1
]}

Shanbhag, Aakash ^{[1
,2
]}

Miller, Robert J. H. ^{[1
,3
]}

Dey, Damini ^{[1
]}

Liang, Joanna X. ^{[1
]}

Flood, Kathleen ^{[4
]}

Bourque, Jamieson M. ^{[5
]}

Chareonthaitawee, Panithaya ^{[6
]}

Phillips, Lawrence M. ^{[7
]}

Slomka, Piotr J. ^{[1
]}

机构：

[1] Cedars Sinai Med Ctr, Dept Med, Div Artificial Intelligence Med, Imaging & Biomed Sci, Los Angeles, CA 90048 USA

[2] Univ Southern Calif, Signal & Image Proc Inst, Ming Hsieh Dept Elect & Comp Engn, Los Angeles, CA USA

[3] Univ Calgary, Dept Cardiac Sci, Calgary, AB, Canada

[4] Amer Soc Nucl Cardiol, Fairfax, VA USA

[5] Univ Virginia Hlth Syst, Div Cardiovasc Med & Radiol, Charlottesville, VA USA

[6] Mayo Clin, Dept Cardiovasc Med, Rochester, MN USA

[7] NYU Grossman Sch Med, Dept Med, Leon H Charney Div Cardiol, New York, NY USA

来源：

JOURNAL OF NUCLEAR CARDIOLOGY | 2025年 / 45卷

基金：

美国国家卫生研究院;

关键词：

Nuclear cardiology board exam; Large language models; GPT; Cardiovascular imaging questions; PERFORMANCE;

D O I：

10.1016/j.nuclcard.2024.102089

中图分类号：

R5 [内科学];

学科分类号：

1002 ; 100201 ;

摘要：

Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions. Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4%- 58.0%), 40.5% (39.9%- 42.9%), 60.7% (59.5% 61.3%), and 63.1% (62.5%e64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all). Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT4o shows potential to support physicians in answering text-based clinical questions.

引用

页数：11

共 50 条

[21] Evaluating large language models for software testing
Li, Yihao
Liu, Pan
Wang, Haiyang
Chu, Jie
Wong, W. Eric
COMPUTER STANDARDS & INTERFACES, 2025, 93
[22] Evaluating Intelligence and Knowledge in Large Language Models
Bianchini, Francesco
TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
[23] Evaluating large language models as agents in the clinic
Mehandru, Nikita
Miao, Brenda Y.
Almaraz, Eduardo Rodriguez
Sushil, Madhumita
Butte, Atul J.
Alaa, Ahmed
NPJ DIGITAL MEDICINE, 2024, 7 (01)
[24] Large Language Models and Generative AI, Oh My!
Zyda, Michael
COMPUTER, 2024, 57 (03) : 127 - 132
[25] LAraBench: Benchmarking Arabic AI with Large Language Models
Abdelali, Ahmed
Mubarak, Hamdy
Chowdhury, Shammur Absar
Hasanain, Maram
Mousi, Basel
Boughorbel, Sabri
Abdaljalil, Samir
El Kheir, Yassine
Izham, Daniel
Dalvi, Fahim
Hawasly, Majd
Nazar, Nizi
Elshahawy, Yousseif
Ali, Ahmed
Durrani, Nadir
Milic-Frayling, Natasa
Alam, Firoj
PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 487 - 520
[26] Neurosymbolic AI Approach to Attribution in Large Language Models
Tilwani, Deepa
Venkataramanan, Revathy
Sheth, Amit P.
IEEE INTELLIGENT SYSTEMS, 2024, 39 (06) : 10 - 17
[27] The promise of AI Large Language Models for Epilepsy care
Landais, Raphaelle
Sultan, Mustafa
Thomas, Rhys H.
EPILEPSY & BEHAVIOR, 2024, 154
[28] LAraBench: Benchmarking Arabic AI with Large Language Models
Qatar Computing Research Institute, HBKU, Qatar
不详
arXiv, 1600,
[29] AI Computing Systems for Large Language Models Training
Zhang, Zhen-Xing
Wen, Yuan-Bo
Lyu, Han-Qi
Liu, Chang
Zhang, Rui
Li, Xia-Qing
Wang, Chao
Du, Zi-Dong
Guo, Qi
Li, Ling
Zhou, Xue-Hai
Chen, Yun-Ji
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2025, 40 (01) : 6 - 41
[30] Large language models make AI usable for everyone!
Bause, Fabian
Konstruktion, 2024, 76 (04): : 3 - 5

← 1 2 3 4 5 →