A Comparative Study: Can Large Language Models Beat Radiologists on PI-RADSv2.1-Related Questions?

被引:0
|
作者
Eren, Çamur [1 ]
Turay, Cesur [2 ]
Celal, Güneş Yasin [3 ]
机构
[1] Department of Radiology, Ministry of Health Ankara 29 Mayis State Hospital, Aydınlar, Dikmen, Cd No:312, Ankara, Çankaya,06105, Turkey
[2] Department of Radiology, Ankara Mamak State Hospital, Ankara, Turkey
[3] Department of Radiology, TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hastanesi, Kırıkkale, Turkey
关键词
Purpose: This study evaluates the accuracy of various large language models (LLMs) and compares them with radiologists in answering multiple-choice questions (MCQs) related to Prostate Imaging–Reporting and Data System version 2.1 (PI-RADSv2.1). Methods: This cross-sectional study utilizes one-hundred MCQs covering all sections of PI-RADSv2.1 were prepared and asked twelve different LLMs, including Claude 3 Opus, Claude Sonnet, ChatGPT models (ChatGPT 4o, ChatGPT 4 Turbo, ChatGPT 4, ChatGPT 3.5), Google Gemini models (Gemini 1.5 pro, Gemini 1.0), Microsoft Copilot, Perplexity, Meta Llama 3 70B, and Mistral Large. Two board-certified (EDiR) radiologists (radiologist 1,2) also answered the questions independently. Non-parametric tests were used for statistical analysis due to the non-normal distribution of data. Results: Claude 3 Opus achieved the highest accuracy rate (85%) among the LLMs, followed by ChatGPT 4 Turbo (82%) and ChatGPT 4o (80%), ChatGPT 4 (79%), Gemini 1.5pro (79%) both radiologists (79% each). There was no significant difference in performance among Claude 3 Opus, ChatGPT 4 models, Gemini 1.5 Pro, and radiologists (p > 0.05). Conclusion: The fact that Claude 3 Opus shows better results than all other LLMs (including the newest ChatGPT 4o) raises the question of whether it could be a new game changer among LLMs. The high accuracy rates of Claude 3 Opus, ChatGPT 4 models, and Gemini 1.5 Pro, comparable to those of radiologists, highlight their potential as clinical decision support tools. This study highlights the potential of LLMs in radiology, suggesting a transformative impact on diagnostic accuracy and efficiency. © Taiwanese Society of Biomedical Engineering 2024;
D O I
10.1007/s40846-024-00914-3
中图分类号
学科分类号
摘要
引用
收藏
页码:821 / 830
页数:9
相关论文
共 50 条
  • [41] Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivityof Large Language Models to Surgical Patient Questions:Cross-Sectional Study
    Dagli, Mert Marcel
    Oettl, Felix Conrad
    Ujral, Jaskeerat
    Malhotra, Kashish
    Ghenbot, Yohannes
    Yoon, Jang W.
    Ozturk, Ali K.
    Welch, William C.
    JMIR FORMATIVE RESEARCH, 2024, 8
  • [42] Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study
    Sui, Yuan
    Zhou, Mengyu
    Zhou, Mingjie
    Han, Shi
    Zhang, Dongmei
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 645 - 654
  • [43] Effectiveness of various general large language models in clinical consensus and case analysis in dental implantology: a comparative study
    Yuepeng Wu
    Yukang Zhang
    Mei Xu
    Chen Jinzhi
    Yican Xue
    Yuchen Zheng
    BMC Medical Informatics and Decision Making, 25 (1)
  • [44] Evaluation of the Performance of Three Large Language Models in Clinical Decision Support: A Comparative Study Based on Actual Cases
    Xueqi Wang
    Haiyan Ye
    Sumian Zhang
    Mei Yang
    Xuebin Wang
    Journal of Medical Systems, 49 (1)
  • [45] Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study
    Wang, Lei
    Bi, Wenshuai
    Zhao, Suling
    Ma, Yinyao
    Lv, Longting
    Meng, Chenwei
    Fu, Jingru
    Lv, Hanlin
    JMIR FORMATIVE RESEARCH, 2024, 8
  • [46] Accuracy of Large Language Models in Thyroid Nodule-Related Questions Based on the Korean Thyroid Imaging Reporting and Data System (K-TIRADS)
    Kaba, Esat
    Hursoy, Nur
    Solak, Merve
    Celiker, Fatma Beyazal
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (05) : 499 - 500
  • [47] Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge?
    Lingxuan Zhu
    Weiming Mou
    Rui Chen
    Journal of Translational Medicine, 21
  • [48] Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge?
    Zhu, Lingxuan
    Mou, Weiming
    Chen, Rui
    JOURNAL OF TRANSLATIONAL MEDICINE, 2023, 21 (01)
  • [49] Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination
    Liu, Mingxin
    Okuhara, Tsuyoshi
    Dai, Zhehao
    Huang, Wenbo
    Gu, Lin
    Okada, Hiroko
    Furukawa, Emi
    Kiuchi, Takahiro
    International Journal of Medical Informatics, 2025, 193
  • [50] Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing
    Kumari, Amita
    Kumari, Anita
    Singh, Amita
    Singh, Sanjeet K.
    Juhi, Ayesha
    Dhanvijay, Anup Kumar D.
    Pinjar, Mohammed Jaffer
    Mondal, Himel
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)