ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions

被引:14
|
作者
Buhr, Christoph Raphael [1 ,2 ,6 ]
Smith, Harry [3 ]
Huppertz, Tilman [1 ]
Bahr-Hamm, Katharina [1 ]
Matthias, Christoph [1 ]
Blaikie, Andrew [2 ]
Kelsey, Tom [3 ]
Kuhn, Sebastian [4 ,5 ]
Eckrich, Jonas [1 ]
机构
[1] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Dept Otorhinolaryngol, Mainz, Germany
[2] Univ St Andrews, Sch Med, St Andrews, Scotland
[3] Univ St Andrews, Sch Comp Sci, St Andrews, Scotland
[4] Philipps Univ Marburg, Inst Digital Med, Marburg, Germany
[5] Univ Hosp Giessen & Marburg, Marburg, Germany
[6] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Dept Otorhinolaryngol, Langenbeckstr 1, D-55131 Mainz, Germany
来源
JMIR MEDICAL EDUCATION | 2023年 / 9卷
关键词
large language models; LLMs; LLM; artificial intelligence; AI; ChatGPT; otorhinolaryngology; ORL; digital health; chatbots; global health; low-and middle-income countries; telemedicine; telehealth; language model; chatbot; ONLINE HEALTH INFORMATION; NONVERBAL-COMMUNICATION; SEEKING; ANXIETY; GOOGLE;
D O I
10.2196/49183
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more "consultations" of LLMs about personal medical symptoms.Objective: This study aims to evaluate ChatGPT's performance in answering clinical case-based questions in otorhinolaryngology (ORL) in comparison to ORL consultants' answers.Methods: We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 wasResults: Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT's scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT's answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001).Conclusions: While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants' answers. LLMs have potential as augmentative tools for medical care, but their "consultation" for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] A ChatGPT Prompt for Writing Case-Based Multiple-Choice Questions
    Kiyak, Yavuz Selim
    SPANISH JOURNAL OF MEDICAL EDUCATION, 2023, 4 (03): : 98 - 103
  • [2] Evaluation of ChatGPT as a Tool for Answering Clinical Questions in Pharmacy Practice
    Munir, Faria
    Gehres, Anna
    Wai, David
    Song, Leah
    JOURNAL OF PHARMACY PRACTICE, 2024, 37 (06) : 1303 - 1310
  • [3] Evaluation of ChatGPT and Gemini in Answering Patient Questions After Gynecologic Surgery
    Voigt, P.
    Sharma, R.
    Milad, M. P.
    Chaudhari, A.
    Tsai, S.
    Yang, L.
    OBSTETRICS AND GYNECOLOGY, 2025, 145 (5S): : 39S - 39S
  • [4] Evaluation of the Accuracy of ChatGPT in Answering Clinical Questions on the Japanese Society of Hypertension Guidelines
    Kusunose, Kenya
    Kashima, Shuichiro
    Sata, Masataka
    CIRCULATION JOURNAL, 2023, 87 (07) : 1030 - 1033
  • [5] Evaluation of ChatGPT-4 Performance in Answering Patients' Questions About the Management of Type 2 Diabetes
    Gokbulut, Puren
    Kuskonmaz, Serife Mehlika
    Onder, Cagatay Emir
    Taskaldiran, Isilay
    Koc, Gonul
    MEDICAL BULLETIN OF SISLI ETFAL HOSPITAL, 2024, 58 (04): : 483 - 490
  • [6] Knowledge Base Question Answering by Case-based Reasoning over Subgraphs
    Das, Rajarshi
    Godbole, Ameya
    Naik, Ankita
    Tower, Elliot
    Jia, Robin
    Zaheer, Manzil
    Hajishirzi, Hannaneh
    McCallum, Andrew
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [7] Performance of ChatGPT versus Google Bard on Answering Postgraduate-Level Surgical Examination Questions: A Meta-Analysis
    Andrew, Albert
    Zhao, Sunny
    INDIAN JOURNAL OF SURGERY, 2025,
  • [8] ChatGPT-Generated Case-Based Learning Cases Come at a Cost…
    Patera, Eleni
    Mozu-Simpson, Fanny
    MEDICAL SCIENCE EDUCATOR, 2025, 35 (01) : 589 - 590
  • [9] Is ChatGPT 'ready' to be a learning tool for medical undergraduates and will it perform equally in different subjects? Comparative study of ChatGPT performance in tutorial and case-based learning questions in physiology and biochemistry
    Luke, W. A. Nathasha V.
    Chong, Lee Seow
    Ban, Kenneth H.
    Wong, Amanda H.
    Xiong, Chen Zhi
    Shing, Lee Shuh
    Taneja, Reshma
    Samarasekera, Dujeepa D.
    Yap, Celestial T.
    MEDICAL TEACHER, 2024, 46 (11) : 1441 - 1447
  • [10] An evaluation of the usefulness of case-based explanation
    Cunningham, P
    Doyle, D
    Loughrey, J
    CASE-BASED REASONING RESEARCH AND DEVELOPMENT, PROCEEDINGS, 2003, 2689 : 122 - 130