Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases

被引：16

作者：

Sonoda, Yuki ^{[1
]}

Kurokawa, Ryo ^{[1
]}

Nakamura, Yuta ^{[1
]}

Kanzawa, Jun ^{[1
]}

Kurokawa, Mariko ^{[1
]}

Ohizumi, Yuji ^{[1
]}

Gonoi, Wataru ^{[1
]}

Abe, Osamu ^{[1
]}

机构：

[1] Univ Tokyo, Grad Sch Med, Dept Radiol, 7-3-1 Hongo,Bunkyo Ku, Tokyo 1138655, Japan

来源：

JAPANESE JOURNAL OF RADIOLOGY | 2024年 / 42卷 / 11期

关键词：

Large language model; Artificial intelligence; ChatGPT; GPT-4o; Claude; 3; opus; Gemini; 1.5; pro;

D O I：

10.1007/s11604-024-01619-y

中图分类号：

R8 [特种医学]; R445 [影像诊断学];

学科分类号：

1002 ; 100207 ; 1009 ;

摘要：

PurposeLarge language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts.Materials and methodsClinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane's Q and post hoc McNemar's tests.ResultsThe respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models.ConclusionClaude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.

引用

页码：1231 / 1235

页数：5

共 9 条

[1] Capabilities of GPT-4o and Gemini 1.5 Pro in Gram stain and bacterial shape identification
Hindy, Joya-Rita
Souaid, Tarek
Kovacs, Christopher S.
FUTURE MICROBIOLOGY, 2024, 19 (15) : 1283 - 1292
[2] Evaluating the Visual Accuracy of Gemini Pro 1.5 and GPT-4o in Identifying Endoscopic Anatomical Landmarks
Kerbage, Anthony
Souaid, Tarek
Macaron, Carole
Burke, Carol A.
Rouphael, Carol
AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S):
[3] Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases
Kurokawa, Ryo
Ohizumi, Yuji
Kanzawa, Jun
Kurokawa, Mariko
Sonoda, Yuki
Nakamura, Yuta
Kiguchi, Takao
Gonoi, Wataru
Abe, Osamu
JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (12) : 1399 - 1402
[4] Diagnostic Performance of GPT-4o and Claude 3 Opus in Determining Causes of Death From Medical Histories and Postmortem CT Findings
Ishida, Masanori
Gonoi, Wataru
Nyunoya, Keisuke
Abe, Hiroyuki
Shirota, Go
Okimoto, Naomasa
Fujimoto, Kotaro
Kurokawa, Mariko
Nakai, Motoki
Saito, Kazuhiro
Ushiku, Tetsuo
Abe, Osamu
CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
[5] Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases
Suh, Pae Sun
Shim, Woo Hyun
Suh, Chong Hyun
Heo, Hwon
Park, Chae Ri
Eom, Hye Joung
Park, Kye Jin
Choe, Jooae
Kim, Pyeong Hwa
Park, Hyo Jung
Ahn, Yura
Park, Ho Young
Choi, Yoonseok
Woo, Chang-Yun
Park, Hyungjun
RADIOLOGY, 2024, 312 (01) : e240273
[6] Visual Accuracy of Gemini Pro 1.5 and GPT-4o in Determining Ulcerative Colitis Severity Based on Endoscopic Images Using the Modified Mayo Endoscopic Score
Souaid, Tarek
Kerbage, Anthony
Macaron, Carole
Burke, Carol A.
Rouphael, Carol
AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S863 - S863
[7] Evaluating the accuracy, time and cost of GPT-4 and GPT-4o in liver disease diagnoses using cases from "What is Your Diagnosis"
Guo, Yusheng
Li, Tianxiang
Xie, Jiao
Luo, Miao
Zheng, Chuansheng
JOURNAL OF HEPATOLOGY, 2025, 82 (01) : e15 - e17
[8] Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis
Liu, Xu
Duan, Chaoli
Kim, Min-kyu
Zhang, Lu
Jee, Eunjin
Maharjan, Beenu
Huang, Yuwei
Du, Dan
Jiang, Xian
JMIR MEDICAL INFORMATICS, 2024, 12
[9] Comparative diagnostic accuracy of GPT-4o and LLaMA 3-70b: Proprietary vs. open-source large language models in radiology☆
Li, David
Gupta, Kartik
Bhaduri, Mousumi
Sathiadoss, Paul
Bhatnagar, Sahir
Chong, Jaron
CLINICAL IMAGING, 2025, 118

← 1 →