Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

被引:0
|
作者
Pack A. [1 ]
Barrett A. [2 ]
Escalante J. [1 ]
机构
[1] Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, 96762-1293, HI
[2] College of Education, Florida State University, Stone Building, 114 West Call Street, Tallahassee, 32306-2400, FL
关键词
Artificial intelligence; Automatic essay scoring; Automatic writing evaluation; ChatGPT; Generative AI; Large language model;
D O I
10.1016/j.caeai.2024.100234
中图分类号
学科分类号
摘要
Advancements in generative AI, such as large language models (LLMs), may serve as a potential solution to the burdensome task of essay grading often faced by language education teachers. Yet, the validity and reliability of leveraging LLMs for automatic essay scoring (AES) in language education is not well understood. To address this, we evaluated the cross-sectional and longitudinal validity and reliability of four prominent LLMs, Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4, for the AES of English language learners' writing. 119 essays taken from an English language placement test were assessed twice by each LLM, on two separate occasions, as well as by a pair of human raters. GPT-4 performed the best, demonstrating excellent intrarater reliability and good validity. All models, with the exception of GPT-3.5, improved over time in their intrarater reliability. The interrater reliability of GPT-3.5 and GPT-4, however, decreased slightly over time. These findings indicate that some models perform better than others in AES and that all models are subject to fluctuations in their performance. We discuss potential reasons for such variability, and offer suggestions for prospective avenues of research. © 2024 The Authors
引用
收藏
相关论文
共 50 条
  • [21] English language learners and automated scoring of essays: Critical considerations
    Weigle, Sara Cushing
    ASSESSING WRITING, 2013, 18 (01) : 85 - 99
  • [22] Automated Scoring of Constructed Response Items in Math Assessment Using Large Language Models
    Morris, Wesley
    Holmes, Langdon
    Choi, Joon Suh
    Crossley, Scott
    INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2024,
  • [23] Nexus of essay writing and computer-assisted language learning (CALL) in English language classroom
    Tariq, Umbreen
    INTERACTIVE TECHNOLOGY AND SMART EDUCATION, 2025, 22 (01) : 103 - 133
  • [24] AN APPROACH FOR AUTOMATED EVALUATION OF ESSAY-WRITING IN SECOND LANGUAGE LEARNING
    Kishi, Yasuhito
    INTED2016: 10TH INTERNATIONAL TECHNOLOGY, EDUCATION AND DEVELOPMENT CONFERENCE, 2016, : 8193 - 8200
  • [25] Large language models and the future of academic writing
    Nayak, P.
    Gogtay, N. J.
    JOURNAL OF POSTGRADUATE MEDICINE, 2024, 70 (02) : 67 - 68
  • [26] Wordcraft Story Writing With Large Language Models
    Yuan, Ann
    Coenen, Andy
    Reif, Emily
    Ippolito, Daphne
    IUI'22: 27TH INTERNATIONAL CONFERENCE ON INTELLIGENT USER INTERFACES, 2022, : 841 - 852
  • [27] PEER: Empowering Writing with Large Language Models
    Sessler, Kathrin
    Xiang, Tao
    Bogenrieder, Lukas
    Kasneci, Enkelejda
    RESPONSIVE AND SUSTAINABLE EDUCATIONAL FUTURES, EC-TEL 2023, 2023, 14200 : 755 - 761
  • [28] Combination of multiple regressoion and text categorization in Automated Essay Scoring of college English writing
    Ge, Shili
    Information Technology Journal, 2013, 12 (24) : 7977 - 7982
  • [29] Implications of dispositions for foreign language writing: The case of the Arabic-English learner
    Pilotti, Maura A. E.
    Al-Mulhem, Huda
    El Alaoui, Khadija
    Waked, Arifi N.
    LANGUAGE TEACHING RESEARCH, 2024,
  • [30] Large Language Models: А Socio-Philosophical Essay
    Penner, Regina, V
    GALACTICA MEDIA-JOURNAL OF MEDIA STUDIES - GALAKTIKA MEDIA-ZHURNAL MEDIA ISSLEDOVANIJ, 2024, 6 (03): : 83 - 100