Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis

被引:2
|
作者
Mavrych, Volodymyr [1 ]
Ganguly, Paul [1 ]
Bolgova, Olena [1 ]
机构
[1] Alfaisal Univ, Coll Med, Riyadh, Saudi Arabia
关键词
anatomy; artificial intelligence; bard; ChatGPT; copilot; extremities; Gemini; large language models; PaLM; PERFORMANCE; GPT-4;
D O I
10.1002/ca.24244
中图分类号
R602 [外科病理学、解剖学]; R32 [人体形态学];
学科分类号
100101 ;
摘要
The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including medical education, raises questions about their accuracy. The primary aim of our study was to undertake a detailed comparative analysis of the proficiencies and accuracies of six different LLMs (ChatGPT-4, ChatGPT-3.5-turbo, ChatGPT-3.5, Copilot, PaLM, Bard, and Gemini) in responding to medical multiple-choice questions (MCQs), and in generating clinical scenarios and MCQs for upper limb topics in a Gross Anatomy course for medical students. Selected chatbots were tested, answering 50 USMLE-style MCQs. The questions were randomly selected from the Gross Anatomy course exam database for medical students and reviewed by three independent experts. The results of five successive attempts to answer each set of questions by the chatbots were evaluated in terms of accuracy, relevance, and comprehensiveness. The best result was provided by ChatGPT-4, which answered 60.5% +/- 1.9% of questions accurately, then Copilot (42.0% +/- 0.0%) and ChatGPT-3.5 (41.0% +/- 5.3%), followed by ChatGPT-3.5-turbo (38.5% +/- 5.7%). Google PaLM 2 (34.5% +/- 4.4%) and Bard (33.5% +/- 3.0%) gave the poorest results. The overall performance of GPT-4 was statistically superior (p < 0.05) to those of Copilot, GPT-3.5, GPT-Turbo, PaLM2, and Bard by 18.6%, 19.5%, 22%, 26%, and 27%, respectively. Each chatbot was then asked to generate a clinical scenario for each of the three randomly selected topics-anatomical snuffbox, supracondylar fracture of the humerus, and the cubital fossa-and three related anatomical MCQs with five options each, and to indicate the correct answers. Two independent experts analyzed and graded 216 records received (0-5 scale). The best results were recorded for ChatGPT-4, then for Gemini, ChatGPT-3.5, and ChatGPT-3.5-turbo, Copilot, followed by Google PaLM 2; Copilot had the lowest grade. Technological progress notwithstanding, LLMs have yet to mature sufficiently to take over the role of teacher or facilitator completely within a Gross Anatomy course; however, they can be valuable tools for medical educators.
引用
收藏
页码:200 / 210
页数:11
相关论文
共 50 条
  • [21] The Potential Role of Large Language Models in Uveitis Care: Perspectives After ChatGPT and Bard Launch
    Ming, Collin Tan Yip
    Rojas-Carabali, William
    Cifuentes-Gonzalez, Carlos
    Agrawal, Rajdeep
    Thorne, Jennifer E.
    Tugal-Tutkun, Ilknur
    Nguyen, Quan Dong
    Gupta, Vishali
    de-la-Torre, Alejandra
    Agrawal, Rupesh
    OCULAR IMMUNOLOGY AND INFLAMMATION, 2024, 32 (07) : 1435 - 1439
  • [22] Evolving Landscape of Large Language Models: An Evaluation of ChatGPT and Bard in Answering Patient Queries on Colonoscopy
    Tariq, Raseen
    Malik, Sheza
    Khanna, Sahil
    GASTROENTEROLOGY, 2024, 166 (01) : 220 - 221
  • [23] Reply to 'Comment on: Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3'
    Zhao, Fang-Fang
    He, Han-Jie
    Liang, Jia-Jian
    Cen, Ling-Ping
    EYE, 2025, : 1433 - 1433
  • [24] Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology
    Dhanvijay, Anup Kumar D.
    Pinjar, Mohammed Jaffer
    Dhokane, Nitin
    Sorte, Smita R.
    Kumari, Amita
    Mondal, Himel
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [25] The Quality of AI-Generated Dental Caries Multiple Choice Questions: A Comparative Analysis of ChatGPT and Google Bard Language Models
    Ahmed, Walaa Magdy
    Azhari, Amr Ahmed
    Alfaraj, Amal
    Alhamadani, Abdulaziz
    Zhang, Min
    Lu, Chang-Tien
    HELIYON, 2024, 10 (07)
  • [26] Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing
    Makrygiannakis, Miltiadis A.
    Giannakopoulos, Kostis
    Kaklamanos, Eleftherios G.
    EUROPEAN JOURNAL OF ORTHODONTICS, 2024,
  • [27] The Security of Using Large Language Models: A Survey with Emphasis on ChatGPT
    Zhou, Wei
    Zhu, Xiaogang
    Han, Qing-Long
    Li, Lin
    Chen, Xiao
    Wen, Sheng
    Xiang, Yang
    IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2025, 12 (01) : 1 - 26
  • [28] The Security of Using Large Language Models:A Survey With Emphasis on ChatGPT
    Wei Zhou
    Xiaogang Zhu
    QingLong Han
    Lin Li
    Xiao Chen
    Sheng Wen
    Yang Xiang
    IEEE/CAA Journal of Automatica Sinica, 2025, 12 (01) : 1 - 26
  • [29] Assisting Static Analysis with Large Language Models: A ChatGPT Experiment
    Li, Haonan
    Hao, Yu
    Zhai, Yizhuo
    Qian, Zhiyun
    PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, : 2107 - 2111
  • [30] Safety analysis in the era of large language models: A case study of STPA using ChatGPT
    Qi, Yi
    Zhao, Xingyu
    Khastgir, Siddartha
    Huang, Xiaowei
    MACHINE LEARNING WITH APPLICATIONS, 2025, 19