Measuring similarity between Karel programs using character and word n-grams

被引：0

作者：

G. Sidorov

M. Ibarra Romero

I. Markov

R. Guzman-Cabrera

L. Chanona-Hernández

F. Velásquez

机构：

[1] Center for Computing Research (CIC),Instituto Politécnico Nacional (IPN)

[2] University of Guanajuato,Engineering Division

[3] Campus Irapuato-Salamanca,Instituto Politécnico Nacional

[4] School of Mechanical and Electrical Engineering (ESIME),undefined

[5] Polytechnic University of Queretaro,undefined

来源：

Programming and Computer Software | 2017年 / 43卷

关键词：

machine learning; similarity; Karel programming language; character ; -grams; word ; -grams; SVM; LSA;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

引用

页码：47 / 50

页数：3

共 50 条

[31] Word Collocations and Character N-grams Agree that Munday Did Not Compose Sir Thomas More
Merriam, Thomas
[J]. ANQ-A QUARTERLY JOURNAL OF SHORT ARTICLES NOTES AND REVIEWS, 2023, 36 (01) : 15 - 23
[32] Relation Extraction with Word Graphs from N-grams
Qin, Han
Tian, Yuanhe
Song, Yan
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2860 - 2868
[33] Arabic Document Similarity Analysis using N-grams and Singular Value Decomposition
Hussein, Ashraf S.
[J]. 2015 IEEE 9TH INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN INFORMATION SCIENCE (RCIS), 2015, : 445 - 455
[34] Using character n-grams to match a list of publications to references in bibliographic databases
Abdulhayoglu, Mehmet Ali
Thijs, Bart
Jeuris, Wouter
[J]. SCIENTOMETRICS, 2016, 109 (03) : 1525 - 1546
[35] Using character n-grams to match a list of publications to references in bibliographic databases
Mehmet Ali Abdulhayoglu
Bart Thijs
Wouter Jeuris
[J]. Scientometrics, 2016, 109 : 1525 - 1546
[36] Social Network Multilingual Author Profiling using character and POS n-grams
Gonzalez-Gallardo, Carlos-Emiliano
Torres-Moreno, Juan-Manuel
Rendon, Azucena Montes
Sierra, Gerardo
[J]. LINGUAMATICA, 2016, 8 (01): : 21 - 29
[37] Webpage genre identification using variable-length character n-grams
Kanaris, Ioannis
Stamatatos, Efstathios
[J]. 19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL II, PROCEEDINGS, 2007, : 3 - +
[38] Malware Detection and Classification Based on n-grams Attribute Similarity
Zhang Fuyong
Zhao Tiezhou
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE) AND IEEE/IFIP INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (EUC), VOL 1, 2017, : 793 - 796
[39] A comparison of character n-grams and dictionaries used for script recognition
Brakensiek, A
Rigoll, G
[J]. SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 241 - 245
[40] On the use of character n-grams as the only intrinsic evidence of plagiarism
Bensalem, Imene
Rosso, Paolo
Chikhi, Salim
[J]. LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (03) : 363 - 396

← 1 2 3 4 5 →