A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?

被引:0
|
作者
Eren, Çamur [1 ]
Turay, Cesur [2 ]
Celal, Güneş Yasin [3 ]
机构
[1] Department of Radiology, Ministry of Health Ankara 29 Mayis State Hospital, Aydınlar, Dikmen, Cd No:312, Ankara, Çankaya,06105, Turkey
[2] Department of Radiology, Ankara Mamak State Hospital, Ankara, Turkey
[3] Department of Radiology, TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hastanesi, Kırıkkale, Turkey
关键词
Purpose: This study evaluates the accuracy of various large language models (LLMs) and compares them with radiologists in answering multiple-choice questions (MCQs) related to Prostate Imaging–Reporting and Data System version 2.1 (PI-RADSv2.1). Methods: This cross-sectional study utilizes one-hundred MCQs covering all sections of PI-RADSv2.1 were prepared and asked twelve different LLMs, including Claude 3 Opus, Claude Sonnet, ChatGPT models (ChatGPT 4o, ChatGPT 4 Turbo, ChatGPT 4, ChatGPT 3.5), Google Gemini models (Gemini 1.5 pro, Gemini 1.0), Microsoft Copilot, Perplexity, Meta Llama 3 70B, and Mistral Large. Two board-certified (EDiR) radiologists (radiologist 1,2) also answered the questions independently. Non-parametric tests were used for statistical analysis due to the non-normal distribution of data. Results: Claude 3 Opus achieved the highest accuracy rate (85%) among the LLMs, followed by ChatGPT 4 Turbo (82%) and ChatGPT 4o (80%), ChatGPT 4 (79%), Gemini 1.5pro (79%) both radiologists (79% each). There was no significant difference in performance among Claude 3 Opus, ChatGPT 4 models, Gemini 1.5 Pro, and radiologists (p > 0.05). Conclusion: The fact that Claude 3 Opus shows better results than all other LLMs (including the newest ChatGPT 4o) raises the question of whether it could be a new game changer among LLMs. The high accuracy rates of Claude 3 Opus, ChatGPT 4 models, and Gemini 1.5 Pro, comparable to those of radiologists, highlight their potential as clinical decision support tools. This study highlights the potential of LLMs in radiology, suggesting a transformative impact on diagnostic accuracy and efficiency. © Taiwanese Society of Biomedical Engineering 2024;
D O I
10.1007/s40846-024-00914-3
中图分类号
学科分类号
摘要
引用
收藏
页码:821 / 830
页数:9
相关论文
共 50 条
  • [1] Can large language models reason about medical questions?
    Lievin, Valentin
    Hother, Christoffer Egeberg
    Motzfeldt, Andreas Geert
    Winther, Ole
    PATTERNS, 2024, 5 (03):
  • [2] Comparative Evaluation of the Accuracies of Large Language Models in Answering VI-RADS-Related Questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (08) : 767 - 768
  • [3] Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study
    Jessica D. Workum
    Bas W. S. Volkers
    Davy van de Sande
    Sumesh Arora
    Marco Goeijenbier
    Diederik Gommers
    Michel E. van Genderen
    Critical Care, 29 (1):
  • [4] Comparison of Performance of Large Language Models on Lung-RADS Related Questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    JCO GLOBAL ONCOLOGY, 2024, 10
  • [5] A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone
    Tailor, Prashant D.
    Dalvin, Lauren A.
    Chen, John J.
    Iezzi, Raymond
    Olsen, Timothy W.
    Scruggs, Brittni A.
    Barkmeier, Andrew J.
    Bakri, Sophie J.
    Ryan, Edwin H.
    Tang, Peter H.
    Iii, D. Wilkin. Parke
    Belin, Peter J.
    Sridhar, Jayanth
    Xu, David
    Kuriyan, Ajay E.
    Yonekawa, Yoshihiro
    Starr, Matthew R.
    OPHTHALMOLOGY SCIENCE, 2024, 4 (04):
  • [6] Analyzing the Efficacy of Large Language Models: A Comparative Study
    Khetarpaul, Sonia
    Sharma, Dolly
    Sinha, Shreya
    Nagpal, Aryan
    Narang, Aarush
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT I, DEXA 2024, 2024, 14910 : 215 - 221
  • [7] Can large language models safely address patient questions following cataract surgery?
    Lim, Ernest Junwei
    Chowdhury, Mohita
    Higham, Aisling
    McKinnon, Rory
    Ventoura, Nikoletta
    He, Yajie Vera
    de Pennington, Nick
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2023, 64 (08)
  • [8] Study Tests Large Language Models' Ability to Answer Clinical Questions
    Harris, Emily
    JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2023, 330 (06): : 496 - 496
  • [9] Comparative Readability Assessment of Four Large Language Models in Answers to Common Contraception Questions
    Patel, Anisha V.
    Panakam, Aisvarya
    Amin, Kanhai
    Doshi, Rushabh H.
    Patil, Ankita
    Sheth, Sangini S.
    OBSTETRICS AND GYNECOLOGY, 2024, 143 (5S): : 13S - 13S
  • [10] Accuracy of Large Language Models in ACR Manual on Contrast Media-Related Questions
    Gunes, Yasin Celal
    Cesur, Turay
    ACADEMIC RADIOLOGY, 2024, 31 (07)