GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination

被引:14
|
作者
Hirano, Yuichiro [1 ,5 ]
Hanaoka, Shouhei [5 ]
Nakao, Takahiro [2 ]
Miki, Soichiro [2 ]
Kikuchi, Tomohiro [2 ,3 ]
Nakamura, Yuta [2 ]
Nomura, Yukihiro [2 ,4 ]
Yoshikawa, Takeharu [2 ]
Abe, Osamu [5 ]
机构
[1] Int Univ Hlth & Welf, Narita Hosp, Dept Radiol, 852 Hatakeda, Narita, Chiba, Japan
[2] Univ Tokyo Hosp, Dept Computat Diagnost Radiol & Prevent Med, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
[3] Jichi Med Univ, Sch Med, Dept Radiol, 3311-1 Yakushiji, Shimotsuke, Tochigi, Japan
[4] Chiba Univ, Ctr Frontier Med Engn, 1-33 Yayoicho,Inage Ku, Chiba, Japan
[5] Univ Tokyo Hosp, Dept Radiol, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
关键词
Artificial intelligence (AI); Large language model (LLM); ChatGPT; GPT-4; Turbo; GPT-4 Turbo with Vision; Japan Diagnostic Radiology Board Examination (JDRBE);
D O I
10.1007/s11604-024-01561-z
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE).Materials and methods The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test.Results The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses.Conclusion No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.
引用
收藏
页码:918 / 926
页数:9
相关论文
共 50 条
  • [41] Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study
    Jin, Hye Kyung
    Kim, Eunyoung
    JMIR MEDICAL EDUCATION, 2024, 10
  • [42] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
    Ballard, David H.
    RADIOLOGY, 2024, 311 (02)
  • [43] The Accuracy of the Multimodal Large Language Model GPT-4 on Sample Questions From the Interventional Radiology Board Examination Response
    Ariyaratne, Sisith
    Jenko, Nathan
    Davies, A. Mark
    Iyengar, Karthikeyan P.
    Botchu, Rajesh
    ACADEMIC RADIOLOGY, 2024, 31 (08) : 3477 - 3477
  • [44] Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study
    Takagi, Soshi
    Watari, Takashi
    Erabi, Ayano
    Sakaguchi, Kota
    JMIR MEDICAL EDUCATION, 2023, 9
  • [45] When vision meets reality: Exploring the clinical applicability of GPT-4 with vision
    Deng, Jiawen
    Heybati, Kiyan
    Shammas-Toma, Matthew
    CLINICAL IMAGING, 2024, 108
  • [46] Letter: Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations
    Wang, Shuo
    Kinoshita, Shotaro
    Yokoyama, Hiromi M.
    NEUROSURGERY, 2024, 95 (05) : e151 - e152
  • [47] Letter: Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations
    Zhu, Huali
    Kong, Yi
    NEUROSURGERY, 2024, 95 (03) : e80 - e80
  • [48] The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education
    Rizzo, Michael G.
    Cai, Nathan
    Constantinescu, David
    JOURNAL OF ORTHOPAEDICS, 2024, 50 : 70 - 75
  • [49] Reply to "Performance of GPT-4 Vision on kidney pathology exam questions"
    Miao, Jing
    Thongprayoon, Charat
    Cheungpasitporn, Wisit
    Cornell, Lynn D.
    AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2024,
  • [50] Cognitive Network Science Reveals Bias in GPT-3, GPT-3.5 Turbo, and GPT-4 Mirroring Math Anxiety in High-School Students
    Abramski, Katherine
    Citraro, Salvatore
    Lombardi, Luigi
    Rossetti, Giulio
    Stella, Massimo
    BIG DATA AND COGNITIVE COMPUTING, 2023, 7 (03)