Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

被引:0
|
作者
Pack A. [1 ]
Barrett A. [2 ]
Escalante J. [1 ]
机构
[1] Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, 96762-1293, HI
[2] College of Education, Florida State University, Stone Building, 114 West Call Street, Tallahassee, 32306-2400, FL
关键词
Artificial intelligence; Automatic essay scoring; Automatic writing evaluation; ChatGPT; Generative AI; Large language model;
D O I
10.1016/j.caeai.2024.100234
中图分类号
学科分类号
摘要
Advancements in generative AI, such as large language models (LLMs), may serve as a potential solution to the burdensome task of essay grading often faced by language education teachers. Yet, the validity and reliability of leveraging LLMs for automatic essay scoring (AES) in language education is not well understood. To address this, we evaluated the cross-sectional and longitudinal validity and reliability of four prominent LLMs, Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4, for the AES of English language learners' writing. 119 essays taken from an English language placement test were assessed twice by each LLM, on two separate occasions, as well as by a pair of human raters. GPT-4 performed the best, demonstrating excellent intrarater reliability and good validity. All models, with the exception of GPT-3.5, improved over time in their intrarater reliability. The interrater reliability of GPT-3.5 and GPT-4, however, decreased slightly over time. These findings indicate that some models perform better than others in AES and that all models are subject to fluctuations in their performance. We discuss potential reasons for such variability, and offer suggestions for prospective avenues of research. © 2024 The Authors
引用
收藏
相关论文
共 50 条
  • [31] Capitalising on Learner Agency and Group Work in Learning Writing in English as a Foreign Language
    Lin, Zheng
    TESOL JOURNAL, 2013, 4 (04) : 633 - 654
  • [32] Automated Thai-Language Essay Scoring using K-NN
    Aungkaseraneekul, Sommart
    Jaruskulchai, Chuleerat
    PROCEEDINGS OF 48TH KASETSART UNIVERSITY ANNUAL CONFERENCE: SCIENCE, 2010, : 35 - 42
  • [33] Automated Essay Scoring Using Natural Language Processing And Text Mining Method
    Gunawansyah
    Rahayu, Riska
    Nurwathi
    Sugiarto, Bambang
    Gunawan
    PROCEEDING OF 14TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATION SYSTEMS, SERVICES, AND APPLICATIONS (TSSA), 2020,
  • [34] Automatic Scoring of Metaphor Creativity with Large Language Models
    DiStefano, Paul V.
    Patterson, John D.
    Beaty, Roger E.
    CREATIVITY RESEARCH JOURNAL, 2024,
  • [35] CityU corpus of essay drafts of English language learners: a corpus of textual revision in second language writing
    Lee, John
    Yeung, Chak Yan
    Zeldes, Amir
    Reznicek, Marc
    Luedeling, Anke
    Webster, Jonathan
    LANGUAGE RESOURCES AND EVALUATION, 2015, 49 (03) : 659 - 683
  • [36] CityU corpus of essay drafts of English language learners: a corpus of textual revision in second language writing
    John Lee
    Chak Yan Yeung
    Amir Zeldes
    Marc Reznicek
    Anke Lüdeling
    Jonathan Webster
    Language Resources and Evaluation, 2015, 49 : 659 - 683
  • [37] Large Language Models for Automated Program Repair
    Ribeiro, Francisco
    COMPANION PROCEEDINGS OF THE 2023 ACM SIGPLAN INTERNATIONAL CONFERENCE ON SYSTEMS, PROGRAMMING, LANGUAGES, AND APPLICATIONS: SOFTWARE FOR HUMANITY, SPLASH COMPANION 2023, 2023, : 7 - 9
  • [38] Large Language Models for Automated Program Repair
    Ribeiro, Francisco
    SPLASH Companion 2023 - Companion Proceedings of the 2023 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, 2023, : 7 - 9
  • [39] Automated Topic Analysis with Large Language Models
    Kirilenko, Andrei
    Stepchenkova, Svetlana
    INFORMATION AND COMMUNICATION TECHNOLOGIES IN TOURISM 2024, ENTER 2024, 2024, : 29 - 34
  • [40] Automated Scoring of Creative Problem Solving With Large Language Models: A Comparison of Originality and Quality Ratings
    Luchini, Simone A.
    Maliakkal, Nadine T.
    Distefano, Paul V.
    Laverghetta Jr, Antonio
    Patterson, John D.
    Beaty, Roger E.
    Reiter-Palmon, Roni
    PSYCHOLOGY OF AESTHETICS CREATIVITY AND THE ARTS, 2025,