Machine learning model for chatGPT usage detection in students' answers to open-ended questions: Case of Lithuanian language

被引:0
|
作者
Stefanovic, Pavel [1 ]
Pliuskuviene, Birute [1 ]
Radvilaite, Urte [1 ]
Ramanauskaite, Simona [2 ]
机构
[1] Vilnius Gediminas Tech Univ, Dept Informat Syst, Sauletekio Al 11, LT-10223 Vilnius, Lithuania
[2] Vilnius Gediminas Tech Univ, Dept Informat Technol, Sauletekio Al 11, LT-10223 Vilnius, Lithuania
关键词
Text plagiarism; chatGPT; Education; Text pre-processing; Machine learning; Lithuanian language;
D O I
10.1007/s10639-024-12589-z
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
The public availability of large language models, such as chatGPT, brings additional possibilities and challenges to education. Education institutions have to identify when large language models are used and when text is generated by the student itself. In this paper, chatGPT usage in students' answers is investigated. The main aim of the research was to build a machine learning model that could be used in the evaluation of students' answers to open-ended questions written in the Lithuanian language. The model should determine whether the answers were originally written students or answered with the help of chatGPT. A new dataset of student answers has been collected in to train machine learning models. The dataset consists of original student answers, chatGPT answers, and paraphrased chatGPT answers. A total of more than 1000 answers have been prepared. 24 combinations of text pre-processing algorithms have been analyzed. In text pre-processing, the main focus was on various tokenization methods, such as the Bag of Words and Ngrams, the stemming algorithm, and the stop words list. For the analyzed dataset, these pre-processing methods were more effective than application of multilanguage BERT for document embedding. Based on the features/properties of the dataset, the following learning algorithms have been investigated: artificial neural networks, decision trees, random forest, gradient boosting trees, k-nearest neighbours, and naive Bayes. The main results show that the highest accuracy of 87% in some cases can be obtained using gradient boosting trees, random forests, and artificial neural network algorithms. The lowest accuracy has been obtained using the k-nearest neighbouring algorithm. Furthermore, the results of experimental research suggest that the usage of chatGPT in student answers can be automatically identified.
引用
收藏
页数:23
相关论文
共 17 条
  • [1] Towards an analysis of answers to open-ended questions in computer-assisted language learning
    Gerbault, J
    [J]. ARTIFICIAL INTELLIGENCE IN EDUCATION: OPEN LEARNING ENVIRONMENTS: NEW COMPUTATIONAL TECHNOLOGIES TO SUPPORT LEARNING, EXPLORATION AND COLLABORATION, 1999, 50 : 686 - 689
  • [2] Machine learning algorithm for grading open-ended physics questions in Turkish
    Cinar, Ayse
    Ince, Elif
    Gezer, Murat
    Yilmaz, Ozgur
    [J]. EDUCATION AND INFORMATION TECHNOLOGIES, 2020, 25 (05) : 3821 - 3844
  • [3] Machine learning algorithm for grading open-ended physics questions in Turkish
    Ayşe Çınar
    Elif Ince
    Murat Gezer
    Özgür Yılmaz
    [J]. Education and Information Technologies, 2020, 25 : 3821 - 3844
  • [4] Students' learning strategies: Effect of giving open-ended questions in advance
    Matsushima, Rumi
    Ozaki, Hitomi
    [J]. INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2016, 51 : 504 - 504
  • [5] Machine Learning and Hebrew NLP for Automated Assessment of Open-Ended Questions in Biology
    Moriah Ariely
    Tanya Nazaretsky
    Giora Alexandron
    [J]. International Journal of Artificial Intelligence in Education, 2023, 33 : 1 - 34
  • [6] Machine Learning and Hebrew NLP for Automated Assessment of Open-Ended Questions in Biology
    Ariely, Moriah
    Nazaretsky, Tanya
    Alexandron, Giora
    [J]. INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2023, 33 (01) : 1 - 34
  • [7] Coding Text Answers to Open-ended Questions: Human Coders and Statistical Learning Algorithms Make Similar Mistakes
    He, Zhoushanyue
    Schonlau, Matthias
    [J]. METHODS DATA ANALYSES, 2021, 15 (01): : 103 - 119
  • [8] The students' creative thinking ability in accomplishing collaborative learning-based open-ended questions
    Hobri
    Nazareth, E.
    Romlah, S.
    Safitri, J.
    Yuliati, N.
    Sarimanah, E.
    Monalisa, L. A.
    Harisantoso, J.
    [J]. FIRST INTERNATIONAL CONFERENCE ON ENVIRONMENTAL GEOGRAPHY AND GEOGRAPHY EDUCATION (ICEGE), 2019, 243
  • [9] Integration and Validation of a Natural Language Processing Machine Learning Suicide Risk Prediction Model Based on Open-Ended Interview Language in the Emergency Department
    Cohen, Joshua
    Wright-Berryman, Jennifer
    Rohlfs, Lesley
    Trocinski, Douglas
    Daniel, LaMonica
    Klatt, Thomas W.
    [J]. FRONTIERS IN DIGITAL HEALTH, 2022, 4
  • [10] Classification of responses to open-ended questions with machine learning and hand-crafted rules : Automatic occupation coding methods
    Takahashi, K
    Takamura, H
    Okumura, M
    [J]. SOCIOLOGICAL THEORY AND METHODS, 2004, 19 (02) : 177 - 195