Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

被引:6
|
作者
Lang, Siegmund Philipp [1 ,2 ]
Yoseph, Ezra Tilahun [1 ]
Gonzalez-Suarez, Aneysis D. [1 ]
Kim, Robert [1 ]
Fatemi, Parastou [3 ]
Wagner, Katherine [4 ]
Maldaner, Nicolai [1 ,5 ,6 ]
Stienen, Martin N. [7 ,8 ,9 ]
Zygourakis, Corinna Clio [1 ]
机构
[1] Stanford Univ, Dept Neurosur, Sch Med, Stanford, CA USA
[2] Univ Hosp Regensburg, Dept Trauma Surg, Regensburg, Germany
[3] Cleveland Clin, Dept Neurosurg, Cleveland, OH USA
[4] Ventura Neurosurg, Ventura, CA USA
[5] Univ Hosp Zurich, Dept Neurosurg, Zurich, Switzerland
[6] Univ Zurich, Clin Neurosci Ctr, Zurich, Switzerland
[7] Cantonal Hosp St Gallen, Department Neurosurg, St Gallen, Switzerland
[8] Cantonal Hosp, Spine Ctr Eastern Switzerland, St Gallen, Switzerland
[9] Med Sch St Gallen, St Gallen, Switzerland
关键词
Artificial intelligence; Large language models; Patient education; Lumbar spine fusion; ChatGPT; Bard; COMPLICATION;
D O I
10.14245/ns.2448098.049
中图分类号
R74 [神经病学与精神病学];
学科分类号
摘要
Objective: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education. Methods: Our study aims to assess the response quality of Open AI (artificial intelligence)'s ChatGPT 3.5 and Google's Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from 'unsatisfactory' to 'excellent.' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale. Results: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard's responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k= 0.041, p= 0.622; Bard: k=-0.040, p= 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism. Conclusion: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs' role in medical education and healthcare communication.
引用
收藏
页码:633 / 641
页数:9
相关论文
共 22 条
  • [1] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
  • [2] Mapping the Landscape of Generative Language Models in Dental Education: A Comparison Between ChatGPT and Google Bard
    Aldukhail, Shaikha
    EUROPEAN JOURNAL OF DENTAL EDUCATION, 2025, 29 (01) : 136 - 148
  • [3] Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini
    Gomez-Cabello, Cesar A.
    Borna, Sahar
    Pressman, Sophia M.
    Haider, Syed Ali
    Forte, Antonio J.
    MEDICINA-LITHUANIA, 2024, 60 (06):
  • [4] ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents
    Guerra, Gage A.
    Hofmann, Hayden L.
    Le, Jonathan L.
    Wong, Alexander M.
    Fathi, Amir
    May, Cory K.
    Petrigliano, Frank A.
    Liu, Joseph N.
    ARTHROSCOPY-THE JOURNAL OF ARTHROSCOPIC AND RELATED SURGERY, 2025, 41 (03): : 557 - 562
  • [5] Assessing the accuracy, usefulness, and readability of artificialintelligence- generated responses to common dermatologic surgery questions for patient education: A double-blinded comparative study of ChatGPT and Google Bard
    Robinson, Michelle A.
    Belzberg, Micah
    Thakker, Sach
    Bibee, Kristin
    Merkel, Emily
    Macfarlane, Deborah F.
    Lim, Jordan
    Scott, Jeffrey F.
    Deng, Min
    Lewin, Jesse
    Soleymani, David
    Rosenfeld, David
    Liu, Rosemarie
    Liu, Tin Yan Alvin
    Ng, Elise
    JOURNAL OF THE AMERICAN ACADEMY OF DERMATOLOGY, 2024, 90 (05) : 1078 - 1080
  • [6] Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco vs Bard vs ChatGPT - A Text-to-SQL Parsing Comparison
    Sun, Shuo
    Zhang, Yuchen
    Yan, Jiahuan
    Gao, Yuze
    Ong, Donovan
    Chen, Bin
    Su, Jian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11225 - 11238
  • [7] Use of generative large language models for patient education on common surgical conditions: a comparative analysis between ChatGPT and Google Gemini
    El Senbawy, Omar Mahmoud
    Patel, Keval Bhavesh
    Wannakuwatte, Randev Ayodhya
    Thota, Akhila N.
    UPDATES IN SURGERY, 2025,
  • [8] Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced
    Lorenzi, Andrea
    Pugliese, Giorgia
    Maniaci, Antonino
    Lechien, Jerome R.
    Allevi, Fabiana
    Boscolo-Rizzo, Paolo
    Vaira, Luigi Angelo
    Saibene, Alberto Maria
    EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2024, 281 (09) : 5001 - 5006
  • [9] Evaluation of Responses to Questions About Keratoconus Using ChatGPT-4.0, Google Gemini and Microsoft Copilot: A Comparative Study of Large Language Models on Keratoconus
    Demir, Suleyman
    EYE & CONTACT LENS-SCIENCE AND CLINICAL PRACTICE, 2025, 51 (03): : e107 - e111
  • [10] Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy
    Tepe, Murat
    Emekli, Emre
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (05)