Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases

被引:16
|
作者
Sonoda, Yuki [1 ]
Kurokawa, Ryo [1 ]
Nakamura, Yuta [1 ]
Kanzawa, Jun [1 ]
Kurokawa, Mariko [1 ]
Ohizumi, Yuji [1 ]
Gonoi, Wataru [1 ]
Abe, Osamu [1 ]
机构
[1] Univ Tokyo, Grad Sch Med, Dept Radiol, 7-3-1 Hongo,Bunkyo Ku, Tokyo 1138655, Japan
关键词
Large language model; Artificial intelligence; ChatGPT; GPT-4o; Claude; 3; opus; Gemini; 1.5; pro;
D O I
10.1007/s11604-024-01619-y
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
PurposeLarge language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts.Materials and methodsClinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane's Q and post hoc McNemar's tests.ResultsThe respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models.ConclusionClaude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.
引用
收藏
页码:1231 / 1235
页数:5
相关论文
共 9 条
  • [1] Capabilities of GPT-4o and Gemini 1.5 Pro in Gram stain and bacterial shape identification
    Hindy, Joya-Rita
    Souaid, Tarek
    Kovacs, Christopher S.
    FUTURE MICROBIOLOGY, 2024, 19 (15) : 1283 - 1292
  • [2] Evaluating the Visual Accuracy of Gemini Pro 1.5 and GPT-4o in Identifying Endoscopic Anatomical Landmarks
    Kerbage, Anthony
    Souaid, Tarek
    Macaron, Carole
    Burke, Carol A.
    Rouphael, Carol
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S):
  • [3] Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases
    Kurokawa, Ryo
    Ohizumi, Yuji
    Kanzawa, Jun
    Kurokawa, Mariko
    Sonoda, Yuki
    Nakamura, Yuta
    Kiguchi, Takao
    Gonoi, Wataru
    Abe, Osamu
    JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (12) : 1399 - 1402
  • [4] Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings
    Ishida, Masanori
    Gonoi, Wataru
    Nyunoya, Keisuke
    Abe, Hiroyuki
    Shirota, Go
    Okimoto, Naomasa
    Fujimoto, Kotaro
    Kurokawa, Mariko
    Nakai, Motoki
    Saito, Kazuhiro
    Ushiku, Tetsuo
    Abe, Osamu
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
  • [5] Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases
    Suh, Pae Sun
    Shim, Woo Hyun
    Suh, Chong Hyun
    Heo, Hwon
    Park, Chae Ri
    Eom, Hye Joung
    Park, Kye Jin
    Choe, Jooae
    Kim, Pyeong Hwa
    Park, Hyo Jung
    Ahn, Yura
    Park, Ho Young
    Choi, Yoonseok
    Woo, Chang-Yun
    Park, Hyungjun
    RADIOLOGY, 2024, 312 (01) : e240273
  • [6] Visual Accuracy of Gemini Pro 1.5 and GPT-4o in Determining Ulcerative Colitis Severity Based on Endoscopic Images Using the Modified Mayo Endoscopic Score
    Souaid, Tarek
    Kerbage, Anthony
    Macaron, Carole
    Burke, Carol A.
    Rouphael, Carol
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S863 - S863
  • [7] Evaluating the accuracy, time and cost of GPT-4 and GPT-4o in liver disease diagnoses using cases from "What is Your Diagnosis"
    Guo, Yusheng
    Li, Tianxiang
    Xie, Jiao
    Luo, Miao
    Zheng, Chuansheng
    JOURNAL OF HEPATOLOGY, 2025, 82 (01) : e15 - e17
  • [8] Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis
    Liu, Xu
    Duan, Chaoli
    Kim, Min-kyu
    Zhang, Lu
    Jee, Eunjin
    Maharjan, Beenu
    Huang, Yuwei
    Du, Dan
    Jiang, Xian
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [9] Comparative diagnostic accuracy of GPT-4o and LLaMA 3-70b: Proprietary vs. open-source large language models in radiology☆
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    CLINICAL IMAGING, 2025, 118