Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

被引:0
|
作者
Pack A. [1 ]
Barrett A. [2 ]
Escalante J. [1 ]
机构
[1] Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, 96762-1293, HI
[2] College of Education, Florida State University, Stone Building, 114 West Call Street, Tallahassee, 32306-2400, FL
关键词
Artificial intelligence; Automatic essay scoring; Automatic writing evaluation; ChatGPT; Generative AI; Large language model;
D O I
10.1016/j.caeai.2024.100234
中图分类号
学科分类号
摘要
Advancements in generative AI, such as large language models (LLMs), may serve as a potential solution to the burdensome task of essay grading often faced by language education teachers. Yet, the validity and reliability of leveraging LLMs for automatic essay scoring (AES) in language education is not well understood. To address this, we evaluated the cross-sectional and longitudinal validity and reliability of four prominent LLMs, Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4, for the AES of English language learners' writing. 119 essays taken from an English language placement test were assessed twice by each LLM, on two separate occasions, as well as by a pair of human raters. GPT-4 performed the best, demonstrating excellent intrarater reliability and good validity. All models, with the exception of GPT-3.5, improved over time in their intrarater reliability. The interrater reliability of GPT-3.5 and GPT-4, however, decreased slightly over time. These findings indicate that some models perform better than others in AES and that all models are subject to fluctuations in their performance. We discuss potential reasons for such variability, and offer suggestions for prospective avenues of research. © 2024 The Authors
引用
收藏
相关论文
共 50 条
  • [41] Beyond semantic distance: Automated scoring of divergent thinking greatly improves with large language models
    Organisciak, Peter
    Acar, Selcuk
    Dumas, Denis
    Berthiaume, Kelly
    THINKING SKILLS AND CREATIVITY, 2023, 49
  • [43] Supporting language learners in science classrooms: insights from middle-school English language learner students
    Braden, Sarah
    Wassell, Beth A.
    Scantlebury, Kathryn
    Grover, Alex
    LANGUAGE AND EDUCATION, 2016, 30 (05) : 438 - 458
  • [44] Automated essay evaluation software in English Language Arts classrooms: Effects on teacher feedback, student motivation, and writing quality
    Wilson, Joshua
    Czik, Amanda
    COMPUTERS & EDUCATION, 2016, 100 : 94 - 109
  • [45] Large-Language Models in Orthodontics: Assessing Reliability and Validity of ChatGPT in Pretreatment Patient Education
    Vassis, Stratos
    Powell, Harriet
    Petersen, Emma
    Barkmann, Asta
    Noeldeke, Beatrice
    Kristensen, Kasper D.
    Stoustrup, Peter
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
  • [46] Potential impact of large language models on academic writing
    Alahdab, Fares
    BMJ EVIDENCE-BASED MEDICINE, 2024, 29 (03) : 201 - 202
  • [47] PERSUASIVE LEGAL WRITING USING LARGE LANGUAGE MODELS
    Curran, Damian
    Levy, Inbar
    Mistica, Meladel
    Hovy, Eduard
    LEGAL EDUCATION REVIEW, 2024, 34 (01):
  • [48] ReaderBench Learns Dutch: Building a Comprehensive Automated Essay Scoring System for Dutch Language
    Dascalu, Mihai
    Westera, Wim
    Ruseti, Stefan
    Trausan-Matu, Stefan
    Kurvers, Hub
    ARTIFICIAL INTELLIGENCE IN EDUCATION, AIED 2017, 2017, 10331 : 52 - 63
  • [49] Linking essay-writing tests using many-facet models and neural automated essay scoring
    Uto, Masaki
    Aramaki, Kota
    BEHAVIOR RESEARCH METHODS, 2024, 56 (08) : 8450 - 8479
  • [50] Automated Essay Evaluation for English Language Learners:A Case Study of MY Access
    Giang Thi Linh Hoang
    Kunnan, Antony John
    LANGUAGE ASSESSMENT QUARTERLY, 2016, 13 (04) : 359 - 376