ChatGPT vs. Gemini: Comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports

被引:0
|
作者
Singh, Ria [1 ]
Hamouda, Mohamed [2 ]
Chamberlin, Jordan H. [2 ]
Toth, Adrienn [2 ]
Munford, James [2 ]
Silbergleit, Matthew [2 ]
Baruah, Dhiraj [2 ]
Burt, Jeremy R. [3 ]
Kabakus, Ismail M. [2 ]
机构
[1] Kansas City Univ, Osteopath Med Sch, Kansas City, MO USA
[2] Med Univ South Carolina, Dept Radiol & Radiol Sci, Div Cardiothorac Imaging, Charleston, SC USA
[3] Univ Utah, Dept Radiol & Radiol Sci, Div Cardiothorac Imaging, Sch Med, Salt Lake City, UT USA
关键词
ChatGPT; Gemini; LLMs; LungRADS; LARGE LANGUAGE MODELS;
D O I
10.1016/j.clinimag.2025.110455
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objective: To evaluate the accuracy of large language models (LLMs) in generating Lung-RADS scores based on lung cancer screening low-dose computed tomography radiology reports. Material and methods: A retrospective cross-sectional analysis was performed on 242 consecutive LDCT radiology reports generated by cardiothoracic fellowship-trained radiologists at a tertiary center. LLMs evaluated included ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced. Each LLM was used to assign LungRADS scores based on the findings section of each report. No domain-specific fine-tuning was applied. Accuracy was determined by comparing the LLM-assigned scores to radiologist-assigned scores. Efficiency was assessed by measuring response times for each LLM. Results: ChatGPT-4o achieved the highest accuracy (83.6 %) in assigning Lung-RADS scores compared to other models, with ChatGPT-3.5 reaching 70.1 %. Gemini and Gemini Advanced had similar accuracy (70.9 % and 65.1 %, respectively). ChatGPT-3.5 had the fastest response time (median 4 s), while ChatGPT-4o was slower (median 10 s). Higher Lung-RADS categories were associated with marginally longer completion times. ChatGPT4o demonstrated the greatest agreement with radiologists (kappa = 0.836), although it was less than the previously reported human interobserver agreement. Conclusion: ChatGPT-4o outperformed ChatGPT-3.5, Gemini, and Gemini Advanced in Lung-RADS score assignment accuracy but did not reach the level of human experts. Despite promising results, further work is needed to integrate domain-specific training and ensure LLM reliability for clinical decision-making in lung cancer screening.
引用
收藏
页数:6
相关论文
共 2 条
  • [1] ChatGPT vs Gemini: Comparative Accuracy and Efficiency in CAD-RADS Score Assignment from Radiology Reports
    Silbergleit, Matthew
    Toth, Adrienn
    Chamberlin, Jordan H.
    Hamouda, Mohamed
    Baruah, Dhiraj
    Derrick, Sydney
    Schoepf, U. Joseph
    Burt, Jeremy R.
    Kabakus, Ismail M.
    JOURNAL OF IMAGING INFORMATICS IN MEDICINE, 2024,
  • [2] Evaluating the accuracy of lung-RADS score extraction from radiology reports: Manual entry versus natural language processing
    Gandomi, Amir
    Hasan, Eusha
    Chusid, Jesse
    Paul, Subroto
    Inra, Matthew
    Makhnevich, Alex
    Raoof, Suhail
    Silvestri, Gerard
    Bade, Brett C.
    Cohen, Stuart L.
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2024, 191