Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations

被引:5
|
作者
Quah, Bernadette [1 ,2 ]
Zheng, Lei [1 ,2 ]
Sng, Timothy Jie Han [1 ,2 ]
Yong, Chee Weng [1 ,2 ]
Islam, Intekhab [1 ,2 ]
机构
[1] Natl Univ Singapore, Fac Dent, Singapore, Singapore
[2] Natl Univ Ctr Oral Hlth, Discipline Oral & Maxillofacial Surg, 9 Lower Kent Ridge Rd, Singapore, Singapore
关键词
Artificial intelligence; Education; Dental; Academic performance; Models; Educational; Mentoring; Educational needs assessment; MEDICAL-EDUCATION;
D O I
10.1186/s12909-024-05881-6
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background This study aimed to answer the research question: How reliable is ChatGPT in automated essay scoring (AES) for oral and maxillofacial surgery (OMS) examinations for dental undergraduate students compared to human assessors? Methods Sixty-nine undergraduate dental students participated in a closed-book examination comprising two essays at the National University of Singapore. Using pre-created assessment rubrics, three assessors independently performed manual essay scoring, while one separate assessor performed AES using ChatGPT (GPT-4). Data analyses were performed using the intraclass correlation coefficient and Cronbach's alpha to evaluate the reliability and inter-rater agreement of the test scores among all assessors. The mean scores of manual versus automated scoring were evaluated for similarity and correlations. Results A strong correlation was observed for Question 1 (r = 0.752-0.848, p < 0.001) and a moderate correlation was observed between AES and all manual scorers for Question 2 (r = 0.527-0.571, p < 0.001). Intraclass correlation coefficients of 0.794-0.858 indicated excellent inter-rater agreement, and Cronbach's alpha of 0.881-0.932 indicated high reliability. For Question 1, the mean AES scores were similar to those for manual scoring (p > 0.05), and there was a strong correlation between AES and manual scores (r = 0.829, p < 0.001). For Question 2, AES scores were significantly lower than manual scores (p < 0.001), and there was a moderate correlation between AES and manual scores (r = 0.599, p < 0.001). Conclusion This study shows the potential of ChatGPT for essay marking. However, an appropriate rubric design is essential for optimal reliability. With further validation, the ChatGPT has the potential to aid students in self-assessment or large-scale marking automated processes.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] A hierarchical classification approach to automated essay scoring
    McNamara, Danielle S.
    Crossley, Scott A.
    Roscoe, Rod D.
    Allen, Laura K.
    Dai, Jianmin
    ASSESSING WRITING, 2015, 23 : 35 - 59
  • [22] Anchoring Validity Evidence for Automated Essay Scoring
    Shermis, Mark D.
    JOURNAL OF EDUCATIONAL MEASUREMENT, 2022, 59 (03) : 314 - 337
  • [23] Automated essay scoring: Psychometric guidelines and practices
    Ramineni, Chaitanya
    Williamson, David M.
    ASSESSING WRITING, 2013, 18 (01) : 25 - 39
  • [24] An Unsupervised Automated Essay-Scoring System
    Chen, Yen-Yu
    Liu, Chien-Liang
    Lee, Chia-Hoang
    Chang, Tao-Hsing
    IEEE INTELLIGENT SYSTEMS, 2010, 25 (05) : 61 - 67
  • [25] Automated Essay Scoring: A Survey of the State of the Art
    Ke, Zixuan
    Ng, Vincent
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 6300 - 6308
  • [26] Automated Bangla Essay Scoring System: ABESS
    Islam, Md. Monjurul
    Hoque, A. S. M. Latiful
    2013 INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV), 2013,
  • [27] Regression or classification? Automated Essay Scoring for Norwegian
    Berggren, Stig Johan
    Rama, Taraka
    Ovrelid, Lilja
    INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS, 2019, : 92 - 102
  • [28] Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring
    Attali, Yigal
    Lewis, Will
    Steier, Michael
    LANGUAGE TESTING, 2013, 30 (01) : 125 - 141
  • [29] Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability
    Pack A.
    Barrett A.
    Escalante J.
    Computers and Education: Artificial Intelligence, 2024, 6
  • [30] Automated Pipeline for Multi-lingual Automated Essay Scoring with ReaderBench
    Ruseti, Stefan
    Paraschiv, Ionut
    Dascalu, Mihai
    McNamara, Danielle S.
    INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2024, 34 (04) : 1460 - 1481