Machine learning model for chatGPT usage detection in students' answers to open-ended questions: Case of Lithuanian language

被引：0

作者：

Stefanovic, Pavel ^{[1
]}

Pliuskuviene, Birute ^{[1
]}

Radvilaite, Urte ^{[1
]}

Ramanauskaite, Simona ^{[2
]}

机构：

[1] Vilnius Gediminas Tech Univ, Dept Informat Syst, Sauletekio Al 11, LT-10223 Vilnius, Lithuania

[2] Vilnius Gediminas Tech Univ, Dept Informat Technol, Sauletekio Al 11, LT-10223 Vilnius, Lithuania

来源：

EDUCATION AND INFORMATION TECHNOLOGIES | 2024年

关键词：

Text plagiarism; chatGPT; Education; Text pre-processing; Machine learning; Lithuanian language;

D O I：

10.1007/s10639-024-12589-z

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

The public availability of large language models, such as chatGPT, brings additional possibilities and challenges to education. Education institutions have to identify when large language models are used and when text is generated by the student itself. In this paper, chatGPT usage in students' answers is investigated. The main aim of the research was to build a machine learning model that could be used in the evaluation of students' answers to open-ended questions written in the Lithuanian language. The model should determine whether the answers were originally written students or answered with the help of chatGPT. A new dataset of student answers has been collected in to train machine learning models. The dataset consists of original student answers, chatGPT answers, and paraphrased chatGPT answers. A total of more than 1000 answers have been prepared. 24 combinations of text pre-processing algorithms have been analyzed. In text pre-processing, the main focus was on various tokenization methods, such as the Bag of Words and Ngrams, the stemming algorithm, and the stop words list. For the analyzed dataset, these pre-processing methods were more effective than application of multilanguage BERT for document embedding. Based on the features/properties of the dataset, the following learning algorithms have been investigated: artificial neural networks, decision trees, random forest, gradient boosting trees, k-nearest neighbours, and naive Bayes. The main results show that the highest accuracy of 87% in some cases can be obtained using gradient boosting trees, random forests, and artificial neural network algorithms. The lowest accuracy has been obtained using the k-nearest neighbouring algorithm. Furthermore, the results of experimental research suggest that the usage of chatGPT in student answers can be automatically identified.

引用

页数：23

共 17 条

[1] Towards an analysis of answers to open-ended questions in computer-assisted language learning
Gerbault, J
[J]. ARTIFICIAL INTELLIGENCE IN EDUCATION: OPEN LEARNING ENVIRONMENTS: NEW COMPUTATIONAL TECHNOLOGIES TO SUPPORT LEARNING, EXPLORATION AND COLLABORATION, 1999, 50 : 686 - 689
[2] Machine learning algorithm for grading open-ended physics questions in Turkish
Cinar, Ayse
Ince, Elif
Gezer, Murat
Yilmaz, Ozgur
[J]. EDUCATION AND INFORMATION TECHNOLOGIES, 2020, 25 (05) : 3821 - 3844
[3] Machine learning algorithm for grading open-ended physics questions in Turkish
Ayşe Çınar
Elif Ince
Murat Gezer
Özgür Yılmaz
[J]. Education and Information Technologies, 2020, 25 : 3821 - 3844
[4] Students' learning strategies: Effect of giving open-ended questions in advance
Matsushima, Rumi
Ozaki, Hitomi
[J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2016, 51 : 504 - 504
[5] Machine Learning and Hebrew NLP for Automated Assessment of Open-Ended Questions in Biology
Moriah Ariely
Tanya Nazaretsky
Giora Alexandron
[J]. International Journal of Artificial Intelligence in Education, 2023, 33 : 1 - 34
[6] Machine Learning and Hebrew NLP for Automated Assessment of Open-Ended Questions in Biology
Ariely, Moriah
Nazaretsky, Tanya
Alexandron, Giora
[J]. INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2023, 33 (01) : 1 - 34
[7] Coding Text Answers to Open-ended Questions: Human Coders and Statistical Learning Algorithms Make Similar Mistakes
He, Zhoushanyue
Schonlau, Matthias
[J]. METHODS DATA ANALYSES, 2021, 15 (01): : 103 - 119
[8] The students' creative thinking ability in accomplishing collaborative learning-based open-ended questions
Hobri
Nazareth, E.
Romlah, S.
Safitri, J.
Yuliati, N.
Sarimanah, E.
Monalisa, L. A.
Harisantoso, J.
[J]. FIRST INTERNATIONAL CONFERENCE ON ENVIRONMENTAL GEOGRAPHY AND GEOGRAPHY EDUCATION (ICEGE), 2019, 243
[9] Integration and Validation of a Natural Language Processing Machine Learning Suicide Risk Prediction Model Based on Open-Ended Interview Language in the Emergency Department
Cohen, Joshua
Wright-Berryman, Jennifer
Rohlfs, Lesley
Trocinski, Douglas
Daniel, LaMonica
Klatt, Thomas W.
[J]. FRONTIERS IN DIGITAL HEALTH, 2022, 4
[10] Classification of responses to open-ended questions with machine learning and hand-crafted rules : Automatic occupation coding methods
Takahashi, K
Takamura, H
Okumura, M
[J]. SOCIOLOGICAL THEORY AND METHODS, 2004, 19 (02) : 177 - 195

← 1 2 →