ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions

被引:14
|
作者
Buhr, Christoph Raphael [1 ,2 ,6 ]
Smith, Harry [3 ]
Huppertz, Tilman [1 ]
Bahr-Hamm, Katharina [1 ]
Matthias, Christoph [1 ]
Blaikie, Andrew [2 ]
Kelsey, Tom [3 ]
Kuhn, Sebastian [4 ,5 ]
Eckrich, Jonas [1 ]
机构
[1] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Dept Otorhinolaryngol, Mainz, Germany
[2] Univ St Andrews, Sch Med, St Andrews, Scotland
[3] Univ St Andrews, Sch Comp Sci, St Andrews, Scotland
[4] Philipps Univ Marburg, Inst Digital Med, Marburg, Germany
[5] Univ Hosp Giessen & Marburg, Marburg, Germany
[6] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Dept Otorhinolaryngol, Langenbeckstr 1, D-55131 Mainz, Germany
来源
JMIR MEDICAL EDUCATION | 2023年 / 9卷
关键词
large language models; LLMs; LLM; artificial intelligence; AI; ChatGPT; otorhinolaryngology; ORL; digital health; chatbots; global health; low-and middle-income countries; telemedicine; telehealth; language model; chatbot; ONLINE HEALTH INFORMATION; NONVERBAL-COMMUNICATION; SEEKING; ANXIETY; GOOGLE;
D O I
10.2196/49183
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more "consultations" of LLMs about personal medical symptoms.Objective: This study aims to evaluate ChatGPT's performance in answering clinical case-based questions in otorhinolaryngology (ORL) in comparison to ORL consultants' answers.Methods: We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 wasResults: Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT's scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT's answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001).Conclusions: While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants' answers. LLMs have potential as augmentative tools for medical care, but their "consultation" for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Impact of ChatGPT on case creation efficiency and learning quality in case-based learning for undergraduate nursing students
    Higashitsuji, Asahiko
    Otsuka, Tomoko
    Watanabe, Kentaro
    TEACHING AND LEARNING IN NURSING, 2025, 20 (01) : e159 - e166
  • [22] Case-Based or Non-Case-Based Questions for Teaching Postgraduate Physicians: A Randomized Crossover Trial
    Cook, David A.
    Thompson, Warren G.
    Thomas, Kris G.
    ACADEMIC MEDICINE, 2009, 84 (10) : 1419 - 1425
  • [23] Case-based MCQ generator: A custom ChatGPT based on published prompts in the literature for automatic item generation
    Kiyak, Yavuz Selim
    Kononowicz, Andrzej A.
    MEDICAL TEACHER, 2024, 46 (08) : 1018 - 1020
  • [24] Case-based assessment in veterinary medicine - searching for alternatives to multiple choice questions
    Schaper, Elisabeth
    Fischer, Martin R.
    Tipold, Andrea
    Ehlers, Jan P.
    TIERAERZTLICHE UMSCHAU, 2011, 66 (06) : 261 - 268
  • [25] Analyzing Human Behavior using Case-Based Reasoning with the help of Forensic Questions
    Boehmer, Wolfgang
    2010 24TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA), 2010, : 1189 - 1194
  • [26] Evaluation of ChatGPT's Proficiency in Pathology: An Analysis of Textual and Image-Based Questions
    Khan, Anam
    Khan, Atif
    Faraz, Muhammad
    Parwani, Anil
    Singh, Rajendra
    Amin, Bijal
    LABORATORY INVESTIGATION, 2024, 104 (03) : S1583 - S1585
  • [27] Catering for the Needs of Diverse Patient Populations: Using ChatGPT to Design Case-Based Learning Scenarios
    Lopez, Mildred
    Goh, Poh-Sun
    MEDICAL SCIENCE EDUCATOR, 2024, 34 (02) : 319 - 325
  • [28] Empirical evaluation of case-based conceptual database design
    Lo, WA
    Choobineh, J
    DECISION SCIENCES INSTITUTE 1998 PROCEEDINGS, VOLS 1-3, 1998, : 618 - 620
  • [29] Case-Based Trust Evaluation from Provenance Information
    Bai, Quan
    Su, Xing
    Liu, Qing
    Terhorst, Andrew
    Zhang, Minjie
    Mu, Yi
    TRUSTCOM 2011: 2011 INTERNATIONAL JOINT CONFERENCE OF IEEE TRUSTCOM-11/IEEE ICESS-11/FCST-11, 2011, : 336 - 343
  • [30] A case-based approach to the evaluation of new audit clients
    University of Nevada - Reno, Reno, NV 89557, United States
    J. Comput. Inf. Syst., 2007, 4 (19-27):