Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined?

被引:26
|
作者
Kumar, Vivekanandan S. [1 ]
Boulanger, David [1 ]
机构
[1] Athabasca Univ, Fac Sci & Technol, Edmonton, AB, Canada
关键词
Automated essay scoring; Deep learning; Neural network; Natural language processing; Feature importance; Rubrics; OF-THE-ART; LEXICAL DIVERSITY;
D O I
10.1007/s40593-020-00211-5
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This article investigates the feasibility of using automated scoring methods to evaluate the quality of student-written essays. In 2012, Kaggle hosted an Automated Student Assessment Prize contest to find effective solutions to automated testing and grading. This article: a) analyzes the datasets from the contest - which contained hand-graded essays - to measure their suitability for developing competent automated grading tools; b) evaluates the potential for deep learning in automated essay scoring (AES) to produce sophisticated testing and grading algorithms; c) advocates for thorough and transparent performance reports on AES research, which will facilitate fairer comparisons among various AES systems and permit study replication; d) uses both deep neural networks and state-of-the-art NLP tools to predict finer-grained rubric scores, to illustrate how rubric scores are determined from a linguistic perspective, and to uncover important features of an effective rubric scoring model. This study's findings first highlight the level of agreement that exists between two human raters for each rubric as captured in the investigated essay dataset, that is, 0.60 on average as measured by the quadratic weighted kappa (QWK). Only one related study has been found in the literature which also performed rubric score predictions through models trained on the same dataset. At best, the predictive models had an average agreement level (QWK) of 0.53 with the human raters, below the level of agreement among human raters. In contrast, this research's findings report an average agreement level per rubric with the two human raters' resolved scores of 0.72 (QWK), well beyond the agreement level between the two human raters. Further, the AES system proposed in this article predicts holistic essay scores through its predicted rubric scores and produces a QWK of 0.78, a competitive performance according to recent literature where cutting-edge AES tools generate agreement levels between 0.77 and 0.81, results computed as per the same procedure as in this article. This study's AES system goes one step further toward interpretability and the provision of high-level explanations to justify the predicted holistic and rubric scores. It contends that predicting rubric scores is essential to automated essay scoring, because it reveals the reasoning behind AIED-based AES systems. Will building AIED accountability improve the trustworthiness of the formative feedback generated by AES? Will AIED-empowered AES systems thoroughly mimic, or even outperform, a competent human rater? Will such machine-grading systems be subjected to verification by human raters, thus paving the way for a human-in-the-loop assessment mechanism? Will trust in new generations of AES systems be improved with the addition of models that explain the inner workings of a deep learning black box? This study seeks to expand these horizons of AES to make the technique practical, explainable, and trustable.
引用
收藏
页码:538 / 584
页数:47
相关论文
共 50 条
  • [1] Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined?
    Vivekanandan S. Kumar
    David Boulanger
    [J]. International Journal of Artificial Intelligence in Education, 2021, 31 : 538 - 584
  • [2] Deep Learning in Automated Essay Scoring
    Boulanger, David
    Kumar, Vivekanandan
    [J]. INTELLIGENT TUTORING SYSTEMS, ITS 2018, 2018, 10858 : 294 - 299
  • [3] Automated Essay Scoring System Based on Rubric
    Yamamoto, Megumi
    Umemura, Nobuo
    Kawano, Hiroyuki
    [J]. APPLIED COMPUTING & INFORMATION TECHNOLOGY, 2018, 727 : 177 - 190
  • [4] Automated Chinese Essay Scoring Based on Deep Learning
    Yuan, Shuai
    He, Tingting
    Huang, Huan
    Hou, Rui
    Wang, Meng
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 65 (01): : 817 - 833
  • [5] Automated essay scoring and flexible learning
    Li, RKY
    Oh, KH
    [J]. INFORMATION TECHNOLOGY AND ORGANIZATIONS: TRENDS, ISSUES, CHALLENGES AND SOLUTIONS, VOLS 1 AND 2, 2003, : 369 - 372
  • [6] Explainable Automated Essay Scoring: Deep Learning Really Has Pedagogical Value
    Kumar, Vivekanandan
    Boulanger, David
    [J]. FRONTIERS IN EDUCATION, 2020, 5
  • [7] A Trait-based Deep Learning Automated Essay Scoring System with Adaptive Feedback
    Hussein, Mohamed A.
    Hassan, Hesham A.
    Nassef, Mohammad
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (05) : 287 - 293
  • [8] Opening the "black box" of deep learning in automated screening of eye diseases
    Gonzalez-Gonzalo, Cristina
    Liefers, Bart
    Vaidyanathan, Akshayaa
    van Zeeland, Harm
    Klaver, Caroline C. W.
    Sanchez, Clara I.
    [J]. INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2019, 60 (09)
  • [9] A review of deep-neural automated essay scoring models
    Uto M.
    [J]. Behaviormetrika, 2021, 48 (2) : 459 - 484
  • [10] Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?
    Su, Minyi
    Feng, Guoqin
    Liu, Zhihai
    Li, Yan
    Wang, Renxiao
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2020, 60 (03) : 1122 - 1136