Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

被引:0
|
作者
Pack A. [1 ]
Barrett A. [2 ]
Escalante J. [1 ]
机构
[1] Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, 96762-1293, HI
[2] College of Education, Florida State University, Stone Building, 114 West Call Street, Tallahassee, 32306-2400, FL
关键词
Artificial intelligence; Automatic essay scoring; Automatic writing evaluation; ChatGPT; Generative AI; Large language model;
D O I
10.1016/j.caeai.2024.100234
中图分类号
学科分类号
摘要
Advancements in generative AI, such as large language models (LLMs), may serve as a potential solution to the burdensome task of essay grading often faced by language education teachers. Yet, the validity and reliability of leveraging LLMs for automatic essay scoring (AES) in language education is not well understood. To address this, we evaluated the cross-sectional and longitudinal validity and reliability of four prominent LLMs, Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4, for the AES of English language learners' writing. 119 essays taken from an English language placement test were assessed twice by each LLM, on two separate occasions, as well as by a pair of human raters. GPT-4 performed the best, demonstrating excellent intrarater reliability and good validity. All models, with the exception of GPT-3.5, improved over time in their intrarater reliability. The interrater reliability of GPT-3.5 and GPT-4, however, decreased slightly over time. These findings indicate that some models perform better than others in AES and that all models are subject to fluctuations in their performance. We discuss potential reasons for such variability, and offer suggestions for prospective avenues of research. © 2024 The Authors
引用
收藏
相关论文
共 50 条
  • [1] Language models in automated essay scoring: Insights for the Turkish language
    Firoozi, Tahereh
    Bulut, Okan
    Gierl, Mark J.
    INTERNATIONAL JOURNAL OF ASSESSMENT TOOLS IN EDUCATION, 2023, 10 : 148 - 162
  • [2] Applying large language models for automated essay scoring for non-native Japanese
    Li, Wenchao
    Liu, Haitao
    HUMANITIES & SOCIAL SCIENCES COMMUNICATIONS, 2024, 11 (01):
  • [3] Automated Essay Scoring and Revising Based on Open-Source Large Language Models
    Song, Yishen
    Zhu, Qianta
    Wang, Huaibo
    Zheng, Qinhua
    IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2024, 17 : 1920 - 1930
  • [4] An automated essay-evaluation corpus of English as a Foreign Language writing
    Jiang, Yaoyi
    BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY, 2015, 46 (05) : 1109 - 1117
  • [5] The Role of First Language in Automated Essay Grading for Second Language Writing
    Hwang, Haerim
    ARTIFICIAL INTELLIGENCE IN EDUCATION, PT II, AIED 2024, 2024, 14830 : 302 - 310
  • [6] Automated language essay scoring systems: a literature review
    Hussein, Mohamed Abdellatif
    Hassan, Hesham
    Nassef, Mohammad
    PEERJ COMPUTER SCIENCE, 2019, 2019 (08)
  • [7] An Examination of the Validity of English-Language Achievement Test Scores in an English Language Learner Population
    Abella, Rodolfo
    Urrutia, Joanne
    Shneyderman, Aleksandr
    BILINGUAL RESEARCH JOURNAL, 2005, 29 (01) : 127 - 144
  • [8] Automated Scoring of Translations with BERT Models: Chinese and English Language Case Study
    Cui, Yizhuo
    Liang, Maocheng
    APPLIED SCIENCES-BASEL, 2024, 14 (05):
  • [9] Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments
    Yavuz, Fatih
    Celik, Ozgur
    Celik, Gamze Yavas
    BRITISH JOURNAL OF EDUCATIONAL TECHNOLOGY, 2024,
  • [10] Validity Arguments for Automated Essay Scoring of Young Students' Writing Traits
    Hannah, L.
    Jang, E. E.
    Shah, M.
    Gupta, V.
    LANGUAGE ASSESSMENT QUARTERLY, 2023, 20 (4-5) : 399 - 420