A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?

被引:0
|
作者
Eren, Çamur [1 ]
Turay, Cesur [2 ]
Celal, Güneş Yasin [3 ]
机构
[1] Department of Radiology, Ministry of Health Ankara 29 Mayis State Hospital, Aydınlar, Dikmen, Cd No:312, Ankara, Çankaya,06105, Turkey
[2] Department of Radiology, Ankara Mamak State Hospital, Ankara, Turkey
[3] Department of Radiology, TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hastanesi, Kırıkkale, Turkey
关键词
Purpose: This study evaluates the accuracy of various large language models (LLMs) and compares them with radiologists in answering multiple-choice questions (MCQs) related to Prostate Imaging–Reporting and Data System version 2.1 (PI-RADSv2.1). Methods: This cross-sectional study utilizes one-hundred MCQs covering all sections of PI-RADSv2.1 were prepared and asked twelve different LLMs, including Claude 3 Opus, Claude Sonnet, ChatGPT models (ChatGPT 4o, ChatGPT 4 Turbo, ChatGPT 4, ChatGPT 3.5), Google Gemini models (Gemini 1.5 pro, Gemini 1.0), Microsoft Copilot, Perplexity, Meta Llama 3 70B, and Mistral Large. Two board-certified (EDiR) radiologists (radiologist 1,2) also answered the questions independently. Non-parametric tests were used for statistical analysis due to the non-normal distribution of data. Results: Claude 3 Opus achieved the highest accuracy rate (85%) among the LLMs, followed by ChatGPT 4 Turbo (82%) and ChatGPT 4o (80%), ChatGPT 4 (79%), Gemini 1.5pro (79%) both radiologists (79% each). There was no significant difference in performance among Claude 3 Opus, ChatGPT 4 models, Gemini 1.5 Pro, and radiologists (p > 0.05). Conclusion: The fact that Claude 3 Opus shows better results than all other LLMs (including the newest ChatGPT 4o) raises the question of whether it could be a new game changer among LLMs. The high accuracy rates of Claude 3 Opus, ChatGPT 4 models, and Gemini 1.5 Pro, comparable to those of radiologists, highlight their potential as clinical decision support tools. This study highlights the potential of LLMs in radiology, suggesting a transformative impact on diagnostic accuracy and efficiency. © Taiwanese Society of Biomedical Engineering 2024;
D O I
10.1007/s40846-024-00914-3
中图分类号
学科分类号
摘要
引用
收藏
页码:821 / 830
页数:9
相关论文
共 50 条
  • [21] Fake News Detection and Classification: A Comparative Study of Convolutional Neural Networks, Large Language Models, and Natural Language Processing Models
    Roumeliotis, Konstantinos I.
    Tselikas, Nikolaos D.
    Nasiopoulos, Dimitrios K.
    Future Internet, 17 (01)
  • [22] Can Large Language Models Truly Understand Prompts? A Case Study with Negated Prompts
    Jang, Joel
    Ye, Seongheyon
    Seo, Minjoon
    TRANSFER LEARNING FOR NATURAL LANGUAGE PROCESSING WORKSHOP, VOL 203, 2022, 203 : 52 - 62
  • [23] Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study
    Iannantuono, Giovanni Maria
    Bracken-Clarke, Dara
    Karzai, Fatima
    Choo-Wosoba, Hyoyoung
    Gulley, James L.
    Floudas, Charalampos S.
    ONCOLOGIST, 2024, : 407 - 414
  • [24] Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education
    Henkel, Owen
    Hills, Libby
    Boxer, Adam
    Roberts, Bill
    Levonian, Zach
    PROCEEDINGS OF THE ELEVENTH ACM CONFERENCE ON LEARNING@SCALE, L@S 2024, 2024, : 300 - 304
  • [25] A Comparative Study of Chatbot Response Generation: Traditional Approaches Versus Large Language Models
    McTear, Michael
    Marokkie, Sheen Varghese
    Bi, Yaxin
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, KSEM 2023, 2023, 14118 : 70 - 79
  • [26] Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction
    Chen, Boqi
    Yi, Fandi
    Varro, Daniel
    2023 ACM/IEEE INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS COMPANION, MODELS-C, 2023, : 588 - 596
  • [27] Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models
    MacNeil, Stephen
    Denny, Paul
    Tran, Andrew
    Leinonen, Juho
    Bernstein, Seth
    Hellas, Arto
    Sarsa, Sami
    Kim, Joanne
    PROCEEDINGS OF THE 26TH AUSTRALASIAN COMPUTING EDUCATION CONFERENCE, ACE 2024, 2024, : 11 - 18
  • [28] Artificial Intelligence in Academic Translation: A Comparative Study of Large Language Models and Google Translate
    Mohsen, Mohammed Ali
    PSYCHOLINGUISTICS, 2024, 35 (02): : 134 - 156
  • [29] Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study
    Wilhelm, Theresa Isabelle
    Roos, Jonas
    Kaczmarczyk, Robert
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [30] Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023
    Khalpey, Zain
    Kumar, Ujjawal
    King, Nicholas
    Abraham, Alyssa
    Khalpey, Amina H.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (07)