A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?

被引:0
|
作者
Eren, Çamur [1 ]
Turay, Cesur [2 ]
Celal, Güneş Yasin [3 ]
机构
[1] Department of Radiology, Ministry of Health Ankara 29 Mayis State Hospital, Aydınlar, Dikmen, Cd No:312, Ankara, Çankaya,06105, Turkey
[2] Department of Radiology, Ankara Mamak State Hospital, Ankara, Turkey
[3] Department of Radiology, TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hastanesi, Kırıkkale, Turkey
关键词
Purpose: This study evaluates the accuracy of various large language models (LLMs) and compares them with radiologists in answering multiple-choice questions (MCQs) related to Prostate Imaging–Reporting and Data System version 2.1 (PI-RADSv2.1). Methods: This cross-sectional study utilizes one-hundred MCQs covering all sections of PI-RADSv2.1 were prepared and asked twelve different LLMs, including Claude 3 Opus, Claude Sonnet, ChatGPT models (ChatGPT 4o, ChatGPT 4 Turbo, ChatGPT 4, ChatGPT 3.5), Google Gemini models (Gemini 1.5 pro, Gemini 1.0), Microsoft Copilot, Perplexity, Meta Llama 3 70B, and Mistral Large. Two board-certified (EDiR) radiologists (radiologist 1,2) also answered the questions independently. Non-parametric tests were used for statistical analysis due to the non-normal distribution of data. Results: Claude 3 Opus achieved the highest accuracy rate (85%) among the LLMs, followed by ChatGPT 4 Turbo (82%) and ChatGPT 4o (80%), ChatGPT 4 (79%), Gemini 1.5pro (79%) both radiologists (79% each). There was no significant difference in performance among Claude 3 Opus, ChatGPT 4 models, Gemini 1.5 Pro, and radiologists (p > 0.05). Conclusion: The fact that Claude 3 Opus shows better results than all other LLMs (including the newest ChatGPT 4o) raises the question of whether it could be a new game changer among LLMs. The high accuracy rates of Claude 3 Opus, ChatGPT 4 models, and Gemini 1.5 Pro, comparable to those of radiologists, highlight their potential as clinical decision support tools. This study highlights the potential of LLMs in radiology, suggesting a transformative impact on diagnostic accuracy and efficiency. © Taiwanese Society of Biomedical Engineering 2024;
D O I
10.1007/s40846-024-00914-3
中图分类号
学科分类号
摘要
引用
收藏
页码:821 / 830
页数:9
相关论文
共 50 条
  • [31] Can large language models help predict results from a complex behavioural science study?
    Lippert, Steffen
    Dreber, Anna
    Johannesson, Magnus
    Tierney, Warren
    Cyrus-Lai, Wilson
    Uhlmann, Eric Luis
    Pfeiffer, Thomas
    ROYAL SOCIETY OPEN SCIENCE, 2024, 11 (09):
  • [32] How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini
    Irmici, Giovanni
    Cozzi, Andrea
    Della Pepa, Gianmarco
    De Berardinis, Claudia
    D'Ascoli, Elisa
    Cellina, Michaela
    Ce, Maurizio
    Depretto, Catherine
    Scaperrotta, Gianfranco
    RADIOLOGIA MEDICA, 2024, : 1463 - 1467
  • [33] Enhancing Code Security Through Open-Source Large Language Models: A Comparative Study
    Ridley, Norah
    Branca, Enrico
    Kimber, Jadyn
    Stakhanova, Natalia
    FOUNDATIONS AND PRACTICE OF SECURITY, PT I, FPS 2023, 2024, 14551 : 233 - 249
  • [34] Learning Mathematics with Large Language Models: A Comparative Study with Computer Algebra Systems and Other Tools
    Matzakos, Nikolaos
    Doukakis, Spyridon
    Moundridou, Maria
    International Journal of Emerging Technologies in Learning, 2023, 18 (20) : 51 - 71
  • [35] Accuracy of Large Language Models in Answering ESUR Guidelines on Contrast Media-Related Questions: Reply to Gunes et al
    Arachchige, Arosh S. Perera Molligoda
    ACADEMIC RADIOLOGY, 2024, 31 (07) : 3078 - 3078
  • [36] Using large language models for safety-related table summarization in clinical study reports
    Landman, Rogier
    Healey, Sean P.
    Loprinzo, Vittorio
    Kochendoerfer, Ulrike
    Winnier, Angela Russell
    Henstock, Peter, V
    Lin, Wenyi
    Chen, Aqiu
    Rajendran, Arthi
    Penshanwar, Sushant
    Khan, Sheraz
    Madhavan, Subha
    JAMIA OPEN, 2024, 7 (02)
  • [37] Clinical Accuracy of Large Language Models and Google Search Responses to Postpartum Depression Questions: Cross-Sectional Study
    Sezgin, Emre
    Chekeni, Faraaz
    Lee, Jennifer
    Keim, Sarah
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25
  • [38] Examining How the Large Language Models Impact the Conceptual Design with Human Designers: A Comparative Case Study
    Zhou, Zhibin
    Li, Jinxin
    Zhang, Zhijie
    Yu, Junnan
    Duh, Henry
    INTERNATIONAL JOURNAL OF HUMAN-COMPUTER INTERACTION, 2024,
  • [39] Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study
    Masanneck, Lars
    Schmidt, Linea
    Seifert, Antonia
    Koelsche, Tristan
    Huntemann, Niklas
    Jansen, Robin
    Mehsin, Mohammed
    Bernhard, Michael
    Meuth, Sven G.
    Boehm, Lennert
    Pawlitzki, Marc
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [40] Large Language Models Versus Expert Clinicians in CrisisPrediction Among Telemental Health Patients:Comparative Study
    Lee, Christine
    Mohebbi, Matthew
    Callaghan, Erin O'
    Winsberg, Mirene
    JMIR MENTAL HEALTH, 2024, 11