Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis

被引:2
|
作者
Mavrych, Volodymyr [1 ]
Ganguly, Paul [1 ]
Bolgova, Olena [1 ]
机构
[1] Alfaisal Univ, Coll Med, Riyadh, Saudi Arabia
关键词
anatomy; artificial intelligence; bard; ChatGPT; copilot; extremities; Gemini; large language models; PaLM; PERFORMANCE; GPT-4;
D O I
10.1002/ca.24244
中图分类号
R602 [外科病理学、解剖学]; R32 [人体形态学];
学科分类号
100101 ;
摘要
The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% +/- 1.9% of questions accurately, then Copilot (42.0% +/- 0.0%) and ChatGPT-3.5 (41.0% +/- 5.3%), followed by ChatGPT-3.5-turbo (38.5% +/- 5.7%). Google PaLM 2 (34.5% +/- 4.4%) and Bard (33.5% +/- 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.
引用
收藏
页码:200 / 210
页数:11
相关论文
共 50 条
  • [41] Translating classical Arabic verse: human translation vs. AI large language models (Gemini and ChatGPT)
    Farghal, Mohammed
    Haider, Ahmad S.
    COGENT SOCIAL SCIENCES, 2024, 10 (01):
  • [42] Re: Large language models (LLMs) in evaluation of emergency radiology reports: performance of ChatGPT-4, Perplexity and Bard
    Wiwanitkit, S.
    Wiwanitkit, V.
    CLINICAL RADIOLOGY, 2024, 79 (04)
  • [43] Exploring Multimodal Large Language Models ChatGPT-4 and Bard for Visual Complexity Evaluation of Mobile User Interfaces
    Akca, Eren
    Tanriover, Omer Ozgur
    TRAITEMENT DU SIGNAL, 2024, 41 (05) : 2673 - 2681
  • [44] Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders
    Koga, Shunsuke
    Martin, Nicholas B.
    Dickson, Dennis W.
    BRAIN PATHOLOGY, 2024, 34 (03)
  • [45] Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard
    Lang, Siegmund Philipp
    Yoseph, Ezra Tilahun
    Gonzalez-Suarez, Aneysis D.
    Kim, Robert
    Fatemi, Parastou
    Wagner, Katherine
    Maldaner, Nicolai
    Stienen, Martin N.
    Zygourakis, Corinna Clio
    NEUROSPINE, 2024, 21 (02) : 633 - 641
  • [46] Qualitative Research Methods for Large Language Models: Conducting Semi-Structured Interviews with ChatGPT and BARD on Computer Science Education
    Dengel, Andreas
    Gehrlein, Rupert
    Fernes, David
    Goerlich, Sebastian
    Maurer, Jonas
    Pham, Hai Hoang
    Grossmann, Gabriel
    Eisermann, Niklas Dietrich genannt
    INFORMATICS-BASEL, 2023, 10 (04):
  • [47] Re: Re: Large language models (LLMs) in evaluation of emergency radiology reports: performance of ChatGPT-4, Perplexity, and Bard
    Amato, Infante
    CLINICAL RADIOLOGY, 2024, 79 (07)
  • [48] Evidence-Based Potential of Generative Artificial Intelligence Large Language Models on Dental Avulsion: ChatGPT Versus Gemini
    Kaplan, Taibe Tokgoz
    Cankar, Muhammet
    DENTAL TRAUMATOLOGY, 2025, 41 (02) : 178 - 186
  • [49] Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini Advanced
    Lorenzi, Andrea
    Pugliese, Giorgia
    Maniaci, Antonino
    Lechien, Jerome R.
    Allevi, Fabiana
    Boscolo-Rizzo, Paolo
    Vaira, Luigi Angelo
    Saibene, Alberto Maria
    EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2024, 281 (09) : 5001 - 5006
  • [50] Using Large Language Models to Support Content Analysis: A Case Study of ChatGPT for Adverse Event Detection
    Leas, Eric C.
    Ayers, John W.
    Desai, Nimit
    Dredze, Mark
    Hogarth, Michael
    Smith, Davey M.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26