Comparing Scoring Consistency of Large Language Models with Faculty for Formative Assessments in Medical Education

被引:0
|
作者
Sreedhar, Radhika [1 ]
Chang, Linda [1 ]
Gangopadhyaya, Ananya [1 ]
Shiels, Peggy Woziwodzki [1 ]
Loza, Julie [1 ]
Chi, Euna [1 ]
Gabel, Elizabeth [1 ]
Park, Yoon Soo [1 ]
机构
[1] Univ Illinois, Coll Med, Chicago, IL 60607 USA
关键词
ChatGPT scoring consistency formative assignments; STUDENTS;
D O I
10.1007/s11606-024-09050-9
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: The Liaison Committee on Medical Education requires that medical students receive individualized feedback on their self-directed learning skills. Pre-clinical students are asked to complete multiple spaced critical appraisal assignments. However, the individual feedback requires significant faculty time. As large language models (LLMs) can score and generate feedback, we explored their use in grading formative assessments through validity and feasibility lenses. Objective: To explore the consistency and feasibility of using an LLM to assess and provide feedback for formative assessments in undergraduate medical education. Design and ParticipantsThis was a cross-sectional study of pre-clinical students' critical appraisal assignments at University of Illinois College of Medicine (UICOM) during the 2022-2023 academic year. Intervention: An initial sample of ten assignments was used to develop a prompt. For each student entry, the de-identified assignment and prompt were provided to ChatGPT 3.5, and its scoring was compared to the existing faculty grade. Main Measures: Differences in scoring of individual items between ChatGPT and faculty were assessed. Scoring consistency using inter-rater reliability (IRR) was calculated as percent exact agreement. Chi-squared test was used to determine if there were significant differences in scores. Psychometric characteristics including internal-consistency reliability, area under precision-recall curve (AUCPR), and cost were studied. Key Results: In this cross-sectional study, 111 pre-clinical students' faculty graded assignments were compared with that of ChatGPT and the scoring of individual items was comparable. The overall agreement between ChatGPT and faculty was 67% (OR = 2.53, P < 0.001); mean AUCPR was 0.69 (range 0.61-0.76). Internal-consistency reliability of ChatGPT was 0.64 and its use resulted in a fivefold reduction in faculty time, and potential savings of 150 faculty hours. Conclusions: This study of psychometric characteristics of ChatGPT demonstrates the potential role for LLMs to assist faculty in assessing and providing feedback for formative assignments.
引用
收藏
页码:127 / 134
页数:8
相关论文
共 50 条
  • [1] Registrar feedback on 'Formative assessments in medical education'
    Osborne, Pete
    Bal, Bahia
    BRITISH JOURNAL OF GENERAL PRACTICE, 2013, 63 (612): : 347 - 348
  • [2] Large Language Models and Their Implications on Medical Education
    Bair, Henry
    Norden, Justin
    ACADEMIC MEDICINE, 2023, 98 (08) : 869 - 870
  • [3] Formative assessments in medical education: a medical graduate's perspective
    Abu-Zaid, Ahmed
    PERSPECTIVES ON MEDICAL EDUCATION, 2013, 2 (5-6) : 358 - 359
  • [4] Providing Automated Feedback on Formative Science Assessments: Uses of Multimodal Large Language Models
    Nguyen, Ha
    Park, Saerok
    FIFTEENTH INTERNATIONAL CONFERENCE ON LEARNING ANALYTICS & KNOWLEDGE, LAK 2025, 2025, : 803 - 809
  • [5] Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
    Seo, Hyein
    Hwang, Taewook
    Jung, Jeesu
    Kang, Hyeonseok
    Namgoong, Hyuk
    Lee, Yohan
    Jung, Sangkeun
    APPLIED SCIENCES-BASEL, 2025, 15 (02):
  • [6] Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions
    Laupichler, Matthias Carl
    Rother, Johanna Flora
    Kadow, Ilona C. Grunwald
    Ahmadi, Seifollah
    Raupach, Tobias
    ACADEMIC MEDICINE, 2024, 99 (05) : 508 - 512
  • [7] Large language models (ChatGPT) in medical education: Embrace or abjure?
    Luke, Nathasha
    Taneja, Reshma
    Ban, Kenneth
    Samarasekera, Dujeepa
    Yap, Celestial T.
    ASIA PACIFIC SCHOLAR, 2023, 8 (04): : 50 - 52
  • [8] A systematic review of large language models and their implications in medical education
    Lucas, Harrison C.
    Upperman, Jeffrey S.
    Robinson, Jamie R.
    MEDICAL EDUCATION, 2024, 58 (11) : 1276 - 1285
  • [9] Impact of Large Language Models on Medical Education andTeaching Adaptations
    Li, Zhui
    Yhap, Nina
    Liu, Liping
    Wang, Zhengjie
    Xiong, Zhonghao
    Yuan, Xiaoshu
    Cui, Hong
    Liu, Xuexiu
    Ren, Wei
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [10] The Role of Large Language Models in Medical Education: Applications and Implications
    Safranek, Conrad W.
    Sidamon-Eristoff, Anne Elizabeth
    Gilson, Aidan
    Chartash, David
    JMIR MEDICAL EDUCATION, 2023, 9