Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education

被引:0
|
作者
Sreedhar, Radhika [1 ]
Chang, Linda [1 ]
Gangopadhyaya, Ananya [1 ]
Shiels, Peggy Woziwodzki [1 ]
Loza, Julie [1 ]
Chi, Euna [1 ]
Gabel, Elizabeth [1 ]
Park, Yoon Soo [1 ]
机构
[1] Univ Illinois, Coll Med, Chicago, IL 60607 USA
关键词
ChatGPT scoring consistency formative assignments; STUDENTS;
D O I
10.1007/s11606-024-09050-9
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses. Objective: To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education. Design and ParticipantsThis was a cross-sectional study of pre-clinical students' critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022-2023 academic year. Intervention: An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade. Main Measures: Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied. Key Results: In this cross-sectional study, 111 pre-clinical students' faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61-0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours. Conclusions: This study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.
引用
收藏
页码:127 / 134
页数:8
相关论文
共 50 条
  • [21] Comparing the dental knowledge of large language models
    Tussie, Camila
    Starosta, Abraham
    BRITISH DENTAL JOURNAL, 2024,
  • [22] AI-Tutoring in Software Engineering Education Experiences with Large Language Models in Programming Assessments
    Frankford, Eduard
    Sauerwein, Clemens
    Bassner, Patrick
    Krusche, Stephan
    Breu, Ruth
    2024 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING EDUCATION AND TRAINING, ICSE-SEET 2024, 2024, : 309 - 319
  • [23] Assessing the Utilization of Large Language Models in Medical Education: Insights From Undergraduate Medical Students
    Biri, Sairavi Kiran
    Kumar, Subir
    Panigrahi, Muralidhar
    Mondal, Shaikat
    Behera, Joshil Kumar
    Himel, Mondal
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (10)
  • [24] Generative Artificial Intelligence and Large Language Models in Primary Care Medical Education
    Parente, Daniel J.
    FAMILY MEDICINE, 2024, 56 (09) : 534 - 540
  • [25] Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint
    Li, Zhui
    Li, Fenghe
    Wang, Xuehu
    Fu, Qining
    Ren, Wei
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [26] ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review
    Aster, Alexandra
    Laupichler, Matthias Carl
    Rockwell-Kollmann, Tamina
    Masala, Gilda
    Bala, Ebru
    Raupach, Tobias
    MEDICAL SCIENCE EDUCATOR, 2024, : 555 - 567
  • [27] Integrating Large Language Models in Bioinformatics Education for Medical Students: Opportunities and Challenges
    Kang, Kai
    Yang, Yuqi
    Wu, Yijun
    Luo, Ren
    ANNALS OF BIOMEDICAL ENGINEERING, 2024, 52 (09) : 2311 - 2315
  • [28] Benchmarking medical large language models
    Bakhshandeh, Sadra
    NATURE REVIEWS BIOENGINEERING, 2023, 1 (08): : 543 - 543
  • [29] The Use of Large Language Models in Education
    Xing, Wanli
    Nixon, Nia
    Crossley, Scott
    Denny, Paul
    Lan, Andrew
    Stamper, John
    Yu, Zhou
    INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2025,
  • [30] Integration of online formative assessments into medical education: Experience from University of Zagreb Medical School, Croatia
    Taradi, SK
    Taradi, M
    Radic, KI
    NATIONAL MEDICAL JOURNAL OF INDIA, 2005, 18 (01): : 39 - 40