Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis

被引:2
|
作者
Mavrych, Volodymyr [1 ]
Ganguly, Paul [1 ]
Bolgova, Olena [1 ]
机构
[1] Alfaisal Univ, Coll Med, Riyadh, Saudi Arabia
关键词
anatomy; artificial intelligence; bard; ChatGPT; copilot; extremities; Gemini; large language models; PaLM; PERFORMANCE; GPT-4;
D O I
10.1002/ca.24244
中图分类号
R602 [外科病理学、解剖学]; R32 [人体形态学];
学科分类号
100101 ;
摘要
The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% +/- 1.9% of questions accurately, then Copilot (42.0% +/- 0.0%) and ChatGPT-3.5 (41.0% +/- 5.3%), followed by ChatGPT-3.5-turbo (38.5% +/- 5.7%). Google PaLM 2 (34.5% +/- 4.4%) and Bard (33.5% +/- 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.
引用
收藏
页码:200 / 210
页数:11
相关论文
共 50 条
  • [31] Evaluating ChatGPT, Gemini and other Large Language Models (LLMs) in orthopaedic diagnostics: A prospective clinical study
    Pagano, Stefano
    Strumolo, Luigi
    Michalk, Katrin
    Schiegl, Julia
    Pulido, Loreto C.
    Reinhard, Jan
    Maderbacher, Guenther
    Renkawitz, Tobias
    Schuster, Marie
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2025, 28 : 9 - 15
  • [32] Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison
    Carla, Matteo Mario
    Gambini, Gloria
    Baldascino, Antonio
    Boselli, Francesco
    Giannuzzi, Federico
    Margollicci, Fabio
    Rizzo, Stanislao
    GRAEFES ARCHIVE FOR CLINICAL AND EXPERIMENTAL OPHTHALMOLOGY, 2024, 262 (09) : 2945 - 2959
  • [33] Large language models (LLMs) in the evaluation of emergency radiology reports: performance of ChatGPT-4, Perplexity, and Bard
    Infante, A.
    Gaudino, S.
    Orsini, F.
    Del Ciello, A.
    Gulli, C.
    Merlino, B.
    Natale, L.
    Iezzi, R.
    Sala, E.
    CLINICAL RADIOLOGY, 2024, 79 (02) : 102 - 106
  • [34] Preliminary fatty liver disease grading using general-purpose online large language models: ChatGPT-4 or Bard?
    Zhang, Yiwen
    Liu, Hanyun
    Sheng, Bin
    Tham, Yih Chung
    Ji, Hongwei
    JOURNAL OF HEPATOLOGY, 2024, 80 (06) : e279 - e281
  • [35] Comparing the Efficacy of Large Language Models ChatGPT, BARD, and Bing AI in Providing Information on Rhinoplasty: An Observational Study
    Seth, Ishith
    Lim, Bryan
    Xie, Yi
    Cevik, Jevan
    Rozen, Warren M.
    Ross, Richard J.
    Lee, Mathew
    AESTHETIC SURGERY JOURNAL OPEN FORUM, 2023, 5
  • [36] Evaluating the Performance of Large Language Models in Anatomy Education Advancing Anatomy Learning with ChatGPT-4o
    Ok, Fatma
    Karip, Burak
    Korkmaz, Fulya Temizsoy
    EUROPEAN JOURNAL OF THERAPEUTICS, 2025, 31 (01): : 35 - 43
  • [37] Incorporation of ChatGPT and Other Large Language Models into a Graduate Level Computational Bioengineering Course
    Michael R. King
    Adam M. Abdulrahman
    Mark I. Petrovic
    Patricia L. Poley
    Sarah P. Hall
    Surat Kulapatana
    Zachary E. Lamantia
    Cellular and Molecular Bioengineering, 2024, 17 : 1 - 6
  • [38] Incorporation of ChatGPT and Other Large Language Models into a Graduate Level Computational Bioengineering Course
    King, Michael R.
    Abdulrahman, Adam M.
    Petrovic, Mark I.
    Poley, Patricia L.
    Hall, Sarah P.
    Kulapatana, Surat
    Lamantia, Zachary E.
    CELLULAR AND MOLECULAR BIOENGINEERING, 2024, 17 (01) : 1 - 6
  • [39] Comparing the Spatial Querying Capacity of Large Language Models: OpenAI's ChatGPT and Google's Gemini Pro
    Renshaw, Andrea
    Lourentzou, Ismini
    Lee, Jinhyung
    Crawford, Thomas
    Kim, Junghwan
    PROFESSIONAL GEOGRAPHER, 2025, 77 (02): : 186 - 198
  • [40] Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini
    Gomez-Cabello, Cesar A.
    Borna, Sahar
    Pressman, Sophia M.
    Haider, Syed Ali
    Forte, Antonio J.
    MEDICINA-LITHUANIA, 2024, 60 (06):