Large language models (LLMs) in radiology exams for medical students: Performance and consequences

被引:0
|
作者
Gotta, Jennifer [1 ]
Hong, Quang Anh Le [1 ]
Koch, Vitali [1 ]
Gruenewald, Leon D. [1 ]
Geyer, Tobias [2 ]
Martin, Simon S. [1 ]
Scholtz, Jan-Erik [1 ]
Booz, Christian [1 ]
Dos Santos, Daniel Pinto [1 ]
Mahmoudi, Scherwin [1 ]
Eichler, Katrin [1 ]
Gruber-Rouh, Tatjana [1 ]
Hammerstingl, Renate [1 ]
Biciusca, Teodora [1 ]
Juergens, Lisa Joy [1 ]
Hoehne, Elena [1 ]
Mader, Christoph [1 ]
Vogl, Thomas J. [1 ]
Reschke, Philipp [1 ]
机构
[1] Goethe Univ Frankfurt, Dept Diagnost & Intervent Radiol, Frankfurt, Germany
[2] Rostock Univ, Med Ctr, Inst Diagnost & Intervent Radiol, Pediat Radiol & Neuroradiol, Rostock, Germany
关键词
AI; medical; education;
D O I
10.1055/a-2437-2067
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose The evolving field of medical education is being shaped by technological advancements, including the integration of Large Language Models (LLMs) like ChatGPT. These models could be invaluable resources for medical students, by simplifying complex concepts and enhancing interactive learning by providing personalized support. LLMs have shown impressive performance in professional examinations, even without specific domain training, making them particularly relevant in the medical field. This study aims to assess the performance of LLMs in radiology examinations for medical students, thereby shedding light on their current capabilities and implications. Materials and Methods This study was conducted using 151 multiple-choice questions, which were used for radiology exams for medical students. The questions were categorized by type and topic and were then processed using OpenAI's GPT-3.5 and GPT- 4 via their API, or manually put into Perplexity AI with GPT-3.5 and Bing. LLM performance was evaluated overall, by question type and by topic. Results GPT-3.5 achieved a 67.6% overall accuracy on all 151 questions, while GPT-4 outperformed it significantly with an 88.1% overall accuracy (p<0.001). GPT-4 demonstrated superior performance in both lower-order and higher-order questions compared to GPT-3.5, Perplexity AI, and medical students, with GPT-4 particularly excelling in higher-order questions. All GPT models would have successfully passed the radiology exam for medical students at our university. Conclusion In conclusion, our study highlights the potential of LLMs as accessible knowledge resources for medical students. GPT-4 performed well on lower-order as well as higher-order questions, making ChatGPT-4 a potentially very useful tool for reviewing radiology exam questions. Radiologists should be aware of ChatGPT's limitations, including its tendency to confidently provide incorrect responses.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis
    Wei, Boxiong
    JMIR MEDICAL EDUCATION, 2025, 11
  • [2] Large language models (LLMs) in the evaluation of emergency radiology reports: performance of ChatGPT-4, Perplexity, and Bard
    Infante, A.
    Gaudino, S.
    Orsini, F.
    Del Ciello, A.
    Gulli, C.
    Merlino, B.
    Natale, L.
    Iezzi, R.
    Sala, E.
    CLINICAL RADIOLOGY, 2024, 79 (02) : 102 - 106
  • [3] Re: Large language models (LLMs) in evaluation of emergency radiology reports: performance of ChatGPT-4, Perplexity and Bard
    Wiwanitkit, S.
    Wiwanitkit, V.
    CLINICAL RADIOLOGY, 2024, 79 (04)
  • [4] Performance of large language models (LLMs) in providing prostate cancer information
    Alasker, Ahmed
    Alsalamah, Seham
    Alshathri, Nada
    Almansour, Nura
    Alsalamah, Faris
    Alghafees, Mohammad
    Alkhamees, Mohammad
    Alsaikhan, Bader
    BMC UROLOGY, 2024, 24 (01):
  • [5] Evaluating the Performance of Large Language Models for Spanish Language in Undergraduate Admissions Exams
    Miranda, Sabino
    Pichardo-Lagunas, Obdulia
    Martinez-Seis, Bella
    Baldi, Pierre
    COMPUTACION Y SISTEMAS, 2023, 27 (04): : 1241 - 1248
  • [6] Causality Extraction from Medical Text Using Large Language Models (LLMs)
    Gopalakrishnan, Seethalakshmi
    Garbayo, Luciana
    Zadrozny, Wlodek
    INFORMATION, 2025, 16 (01)
  • [7] Re: Re: Large language models (LLMs) in evaluation of emergency radiology reports: performance of ChatGPT-4, Perplexity, and Bard
    Amato, Infante
    CLINICAL RADIOLOGY, 2024, 79 (07)
  • [8] Lower Energy Large Language Models (LLMs)
    Lin, Hsiao-Ying
    Voas, Jeffrey
    COMPUTER, 2023, 56 (10) : 14 - 16
  • [9] Assessing the performance of large language models (LLMs) in answering medical questions regarding breast cancer in the Chinese context
    Piao, Ying
    Chen, Hongtao
    Wu, Shihai
    Li, Xianming
    Li, Zihuang
    Yang, Dong
    DIGITAL HEALTH, 2024, 10
  • [10] Towards Safer Large Language Models (LLMs)
    Lawrence, Carolin
    Bifulco, Roberto
    Gashteovski, Kiril
    Hung, Chia-Chien
    Ben Rim, Wiem
    Shaker, Ammar
    Oyamada, Masafumi
    Sadamasa, Kunihiko
    Enomoto, Masafumi
    Takeoka, Kunihiro
    NEC Technical Journal, 2024, 17 (02): : 64 - 74