Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases

被引:12
|
作者
Suh, Pae Sun [1 ,2 ,3 ,4 ,5 ]
Shim, Woo Hyun [1 ,2 ,6 ]
Suh, Chong Hyun [1 ,2 ]
Heo, Hwon [6 ]
Park, Chae Ri [6 ]
Eom, Hye Joung [1 ,2 ]
Park, Kye Jin [1 ,2 ]
Choe, Jooae [1 ,2 ,7 ]
Kim, Pyeong Hwa [1 ,2 ]
Park, Hyo Jung [1 ,2 ]
Ahn, Yura [1 ,2 ]
Park, Ho Young [1 ,2 ]
Choi, Yoonseok
Woo, Chang-Yun [8 ]
Park, Hyungjun [9 ]
机构
[1] Univ Ulsan, Coll Med, Asan Med Ctr, Dept Radiol & Res, Olymp Ro 33, Seoul 05505, South Korea
[2] Univ Ulsan, Coll Med, Asan Med Ctr, Res Inst Radiol, Olymp-ro 33, Seoul 05505, South Korea
[3] Yonsei Univ, Coll Med, Dept Radiol, Seoul, South Korea
[4] Yonsei Univ, Res Inst Radiol Sci, Coll Med, Seoul, South Korea
[5] Yonsei Univ, Coll Med, Ctr Clin Imaging Data Sci, Seoul, South Korea
[6] Univ Ulsan, Asan Med Ctr, Asan Med Inst Convergence Sci & Technol, Coll Med,Dept Med Sci, Seoul, South Korea
[7] Univ Ulsan, Coll Med, Gangneung Asan Hosp, Med Res Inst, Kangnung, South Korea
[8] Univ Ulsan, Coll Med, Asan Med Ctr, Dept Internal Med, Seoul, South Korea
[9] Gumdan Top Hosp, Dept Pulm & Crit Care Med, Incheon, South Korea
基金
新加坡国家研究基金会;
关键词
D O I
10.1148/radiol.240273
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: The diagnostic abilities of multimodal large language models (LLMs) using direct image inputs and the impact of the temperature parameter of LLMs remain unexplored. Purpose: To investigate the ability of GPT-4V and Gemini Pro Vision in generating differential diagnoses at different temperatures compared with radiologists using Radiology Diagnosis Please cases. Materials and Methods: This retrospective study included Diagnosis Please cases published from January 2008 to October 2023. Input images included original images and captures of the textual patient history and figure legends (without imaging findings) from PDF files of each case. The LLMs were tasked with providing three differential diagnoses, repeated five times at temperatures 0, 0.5, and 1. Eight subspecialty-trained radiologists solved cases. An experienced radiologist compared generated and final diagnoses, considering the result correct if the generated diagnoses included the final diagnosis after five repetitions. Accuracy was assessed across models, temperatures, and radiology subspecialties, with statistical significance set at P < .007 after Bonferroni correction for multiple comparisons across the LLMs at the three temperatures and with radiologists. Results: A total of 190 cases were included in neuroradiology (n n = 53), multisystem (n n = 27), gastrointestinal (n n = 25), genitourinary (n n = 23), musculoskeletal (n n = 17), chest (n n = 16), cardiovascular (n n = 12), pediatric (n n = 12), and breast (n n = 5) subspecialties. Overall accuracy improved with increasing temperature settings (0, 0.5, 1) for both GPT-4V (41% [78 of 190 cases], 45% [86 of 190 cases], 49% [93 of 190 cases], respectively) and Gemini Pro Vision (29% [55 of 190 cases], 36% [69 of 190 cases], 39% [74 of 190 cases], respectively), although there was no evidence of a statistically significant difference after Bonferroni adjustment (GPT-4V, P = .12; Gemini Pro Vision, P = .04). The overall accuracy of radiologists (61% [115 of 190 cases]) was higher than that of Gemini Pro Vision at temperature 1 (T1) (P P < .001), while no statistically significant difference was observed between radiologists and GPT4V at T1 after Bonferroni adjustment (P P = .02). Radiologists (range, 45%-88%) outperformed the LLMs at T1 (range, 24%-75%) in most subspecialties. Conclusion: Using direct radiologic image inputs, GPT-4V and Gemini Pro Vision showed improved diagnostic accuracy with increasing temperature settings. Although GPT-4V slightly underperformed compared with radiologists, it nonetheless demonstrated promising potential as a supportive tool in diagnostic decision-making. (c) RSNA, 2024
引用
收藏
页数:10
相关论文
共 6 条
  • [1] Evaluating the image recognition capabilities of GPT-4V and Gemini Pro in the Japanese national dental examination
    Fukuda, Hikaru
    Morishita, Masaki
    Muraoka, Kosuke
    Yamaguchi, Shino
    Nakamura, Taiji
    Yoshioka, Izumi
    Awano, Shuji
    Ono, Kentaro
    JOURNAL OF DENTAL SCIENCES, 2025, 20 (01) : 368 - 372
  • [2] Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases
    Sonoda, Yuki
    Kurokawa, Ryo
    Nakamura, Yuta
    Kanzawa, Jun
    Kurokawa, Mariko
    Ohizumi, Yuji
    Gonoi, Wataru
    Abe, Osamu
    JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (11) : 1231 - 1235
  • [3] Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    RADIOLOGY, 2024, 310 (01)
  • [4] Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases
    Horiuchi, Daisuke
    Tatekawa, Hiroyuki
    Oura, Tatsushi
    Oue, Satoshi
    Walston, Shannon L.
    Takita, Hirotaka
    Matsushita, Shu
    Mitsuyama, Yasuhito
    Shimono, Taro
    Miki, Yukio
    Ueda, Daiju
    CLINICAL NEURORADIOLOGY, 2024, : 779 - 787
  • [5] Boosting GPT-4V's accuracy in dermoscopic classification with few-shot learning. Comment on "can ChatGPT vision diagnose melanoma? An exploratory diagnostic accuracy study"
    Wang, Jinge
    Hu, Gangqing
    JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY, 2024, 91 (06) : e165 - e166
  • [6] Evaluating the accuracy, time and cost of GPT-4 and GPT-4o in liver disease diagnoses using cases from "What is Your Diagnosis"
    Guo, Yusheng
    Li, Tianxiang
    Xie, Jiao
    Luo, Miao
    Zheng, Chuansheng
    JOURNAL OF HEPATOLOGY, 2025, 82 (01) : e15 - e17