Measuring similarity between Karel programs using character and word n-grams

被引：0

作者：

G. Sidorov

M. Ibarra Romero

I. Markov

R. Guzman-Cabrera

L. Chanona-Hernández

F. Velásquez

机构：

[1] Center for Computing Research (CIC),Instituto Politécnico Nacional (IPN)

[2] University of Guanajuato,Engineering Division

[3] Campus Irapuato-Salamanca,Instituto Politécnico Nacional

[4] School of Mechanical and Electrical Engineering (ESIME),undefined

[5] Polytechnic University of Queretaro,undefined

来源：

Programming and Computer Software | 2017年 / 43卷

关键词：

machine learning; similarity; Karel programming language; character ; -grams; word ; -grams; SVM; LSA;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

引用

页码：47 / 50

页数：3

共 50 条

[1] Measuring similarity between Karel programs using character and word n-grams
Sidorov, G.
Ibarra Romero, M.
Markov, I.
Guzman-Cabrera, R.
Chanona-Hernandez, L.
Velasquez, F.
[J]. PROGRAMMING AND COMPUTER SOFTWARE, 2017, 43 (01) : 47 - 50
[2] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
Lecluze, Charlotte
Rigouste, Lois
Giguet, Emmanuel
Lucas, Nadine
[J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
[3] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
ISOTANI, R
MATSUNAGA, S
SAGAYAMA, S
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
[4] Automatic word spacing using probabilistic models based on character n-grams
Lee, Do-Gil
Rim, Hae-Chang
Yook, Dongsuk
[J]. IEEE INTELLIGENT SYSTEMS, 2007, 22 (01) : 28 - 35
[5] Combining Word and Character N-grams for Detecting Deceptive Opinions
Siagian, Al Hafiz Akbar Maulana
Aritsugi, Masayoshi
[J]. 2017 IEEE 41ST ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2017, : 828 - 833
[6] Spam detection using character N-grams
Kanaris, Ioannis
Kanaris, Konstantinos
Stamatatos, Efstathios
[J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
[7] IDF for Word N-grams
Shirakawa, Masumi
Hara, Takahiro
Nishio, Shojiro
[J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (01)
[8] Comparing word, character, and phoneme n-grams for subjective utterance recognition
Wilson, Theresa
Raaijmakers, Stephan
[J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1614 - +
[9] Authorship Attribution in Portuguese Using Character N-grams
Markov, Ilia
Baptista, Jorge
Pichardo-Lagunas, Obdulia
[J]. ACTA POLYTECHNICA HUNGARICA, 2017, 14 (03) : 59 - 78
[10] The subjective frequency of word n-grams
Shaoul, Cyrus
Westbury, Chris F.
Baayen, R. Harald
[J]. PSIHOLOGIJA, 2013, 46 (04) : 497 - 537

← 1 2 3 4 5 →