Measuring similarity between Karel programs using character and word n-grams

被引：0

作者：

G. Sidorov

M. Ibarra Romero

I. Markov

R. Guzman-Cabrera

L. Chanona-Hernández

F. Velásquez

机构：

[1] Center for Computing Research (CIC),Instituto Politécnico Nacional (IPN)

[2] University of Guanajuato,Engineering Division

[3] Campus Irapuato-Salamanca,Instituto Politécnico Nacional

[4] School of Mechanical and Electrical Engineering (ESIME),undefined

[5] Polytechnic University of Queretaro,undefined

来源：

Programming and Computer Software | 2017年 / 43卷

关键词：

machine learning; similarity; Karel programming language; character ; -grams; word ; -grams; SVM; LSA;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

引用

页码：47 / 50

页数：3

共 50 条

[21] Handwritten address recognition with open vocabulary using character n-grams
Brakensiek, A
Rottland, J
Rigoll, G
[J]. EIGHTH INTERNATIONAL WORKSHOP ON FRONTIERS IN HANDWRITING RECOGNITION: PROCEEDINGS, 2002, : 357 - 362
[22] Feature selection on Chinese text classification using character n-grams
Wei, Zhihua
Miao, Duoqian
Chauchat, Jean-Hugues
Zhong, Caiming
[J]. ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
[23] Turkish Spelling Error Detection and Correction by Using Word N-grams
Dalkilic, Gokhan
Cebi, Yalcin
[J]. 2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 63 - 66
[24] Visualizing Document Similarity Using N-Grams and Latent Semantic Analysis
Hussein, Ashraf S.
[J]. PROCEEDINGS OF THE 2016 SAI COMPUTING CONFERENCE (SAI), 2016, : 269 - 279
[25] Classifying True and False Hebrew Stories Using Word N-Grams
HaCohen-Kerner, Yaakov
Dilmon, Rakefet
Friedlich, Shimon
Cohen, Daniel Nissim
[J]. CYBERNETICS AND SYSTEMS, 2016, 47 (08) : 629 - 649
[26] Dissimilarities Detections in Texts Using Symbol n-grams and Word Histograms
Andrejkova, Gabriela
Almarimi, Abdulwahed
[J]. OPEN COMPUTER SCIENCE, 2016, 6 (01): : 168 - 177
[27] Character N-Grams for Detecting Deceptive Controversial Opinions
Sanchez-Junquera, Javier
Villasenor-Pineda, Luis
Montes-y-Gomez, Manuel
Rosso, Paolo
[J]. EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION (CLEF 2018), 2018, 11018 : 135 - 140
[28] Mining generalized character n-grams in large corpora
Marques, NC
Braud, A
[J]. PROGRESS IN ARTIFICIAL INTELLIGENCE-B, 2003, 2902 : 419 - 423
[29] Word Collocations and Character N-grams Agree that Munday Did Not Compose Sir Thomas More
Merriam, Thomas
[J]. ANQ-A QUARTERLY JOURNAL OF SHORT ARTICLES NOTES AND REVIEWS, 2023, 36 (01) : 15 - 23
[30] Relation Extraction with Word Graphs from N-grams
Qin, Han
Tian, Yuanhe
Song, Yan
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2860 - 2868

← 1 2 3 4 5 →