Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis

被引:2
|
作者
Mavrych, Volodymyr [1 ]
Ganguly, Paul [1 ]
Bolgova, Olena [1 ]
机构
[1] Alfaisal Univ, Coll Med, Riyadh, Saudi Arabia
关键词
anatomy; artificial intelligence; bard; ChatGPT; copilot; extremities; Gemini; large language models; PaLM; PERFORMANCE; GPT-4;
D O I
10.1002/ca.24244
中图分类号
R602 [外科病理学、解剖学]; R32 [人体形态学];
学科分类号
100101 ;
摘要
The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% +/- 1.9% of questions accurately, then Copilot (42.0% +/- 0.0%) and ChatGPT-3.5 (41.0% +/- 5.3%), followed by ChatGPT-3.5-turbo (38.5% +/- 5.7%). Google PaLM 2 (34.5% +/- 4.4%) and Bard (33.5% +/- 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.
引用
收藏
页码:200 / 210
页数:11
相关论文
共 50 条
  • [1] A Performance Evaluation of Large Language Models in Keratoconus: A Comparative Study of ChatGPT-3.5, ChatGPT-4.0, Gemini, Copilot, Chatsonic, and Perplexity
    Reyhan, Ali Hakim
    Mutaf, Cagri
    Uzun, Irfan
    Yuksekyayla, Funda
    JOURNAL OF CLINICAL MEDICINE, 2024, 13 (21)
  • [2] Evaluation of Responses to Questions About Keratoconus Using ChatGPT-4.0, Google Gemini and Microsoft Copilot: A Comparative Study of Large Language Models on Keratoconus
    Demir, Suleyman
    EYE & CONTACT LENS-SCIENCE AND CLINICAL PRACTICE, 2025, 51 (03): : e107 - e111
  • [3] Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review
    Fattah, Fattah H.
    Salih, Abdulwahid M.
    Salih, Ameer M.
    Asaad, Saywan K.
    Ghafour, Abdullah K.
    Bapir, Rawa
    Abdalla, Berun A.
    Othman, Snur
    Ahmed, Sasan M.
    Hasan, Sabah Jalal
    Mahmood, Yousif M.
    Kakamad, Fahmi H.
    FRONTIERS IN DIGITAL HEALTH, 2025, 7
  • [4] Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard
    Lim, Zhi Wei
    Pushpanathan, Krithi
    Yew, Samantha Min Er
    Lai, Yien
    Sun, Chen-Hsin
    Lam, Janice Sing Harn
    Chen, David Ziyou
    Goh, Jocelyn Hui Lin
    Tan, Marcus Chun Jin
    Sheng, Bin
    Cheng, Ching-Yu
    Koh, Victor Teck Chang
    Tham, Yih-Chung
    EBIOMEDICINE, 2023, 95
  • [5] Political Bias in Large Language Models: A Comparative Analysis of ChatGPT-4, Perplexity, Google Gemini, and Claude
    Choudhary, Tavishi
    IEEE ACCESS, 2025, 13 : 11341 - 11379
  • [6] Evaluation of ChatGPT and Gemini large language models for pharmacometrics with NONMEM
    Shin, Euibeom
    Yu, Yifan
    Bies, Robert R.
    Ramanathan, Murali
    JOURNAL OF PHARMACOKINETICS AND PHARMACODYNAMICS, 2024, 51 (03) : 187 - 197
  • [7] ChatGPT and Gemini large language models for pharmacometrics with NONMEM: comment
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    JOURNAL OF PHARMACOKINETICS AND PHARMACODYNAMICS, 2024, 51 (04) : 303 - 304
  • [8] ChatGPT, Bard, and Large Language Models for Biomedical Research: Opportunities and Pitfalls
    Thapa, Surendrabikram
    Adhikari, Surabhi
    ANNALS OF BIOMEDICAL ENGINEERING, 2023, 51 (12) : 2647 - 2651
  • [9] ChatGPT, Bard, and Large Language Models for Biomedical Research: Opportunities and Pitfalls
    Surendrabikram Thapa
    Surabhi Adhikari
    Annals of Biomedical Engineering, 2023, 51 : 2647 - 2651
  • [10] Comparative performance analysis of large language models: ChatGPT-3.5, ChatGPT-4 and Google Gemini in glucocorticoid-induced osteoporosis
    Tong, Linjian
    Zhang, Chaoyang
    Liu, Rui
    Yang, Jia
    Sun, Zhiming
    JOURNAL OF ORTHOPAEDIC SURGERY AND RESEARCH, 2024, 19 (01):