Measuring similarity between Karel programs using character and word n-grams

被引:0
|
作者
G. Sidorov
M. Ibarra Romero
I. Markov
R. Guzman-Cabrera
L. Chanona-Hernández
F. Velásquez
机构
[1] Center for Computing Research (CIC),Instituto Politécnico Nacional (IPN)
[2] University of Guanajuato,Engineering Division
[3] Campus Irapuato-Salamanca,Instituto Politécnico Nacional
[4] School of Mechanical and Electrical Engineering (ESIME),undefined
[5] Polytechnic University of Queretaro,undefined
来源
关键词
machine learning; similarity; Karel programming language; character ; -grams; word ; -grams; SVM; LSA;
D O I
暂无
中图分类号
学科分类号
摘要
We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.
引用
收藏
页码:47 / 50
页数:3
相关论文
共 50 条
  • [21] Handwritten address recognition with open vocabulary using character n-grams
    Brakensiek, A
    Rottland, J
    Rigoll, G
    [J]. EIGHTH INTERNATIONAL WORKSHOP ON FRONTIERS IN HANDWRITING RECOGNITION: PROCEEDINGS, 2002, : 357 - 362
  • [22] Feature selection on Chinese text classification using character n-grams
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhong, Caiming
    [J]. ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2008, 5009 : 500 - +
  • [23] Turkish Spelling Error Detection and Correction by Using Word N-grams
    Dalkilic, Gokhan
    Cebi, Yalcin
    [J]. 2009 FIFTH INTERNATIONAL CONFERENCE ON SOFT COMPUTING, COMPUTING WITH WORDS AND PERCEPTIONS IN SYSTEM ANALYSIS, DECISION AND CONTROL, 2010, : 63 - 66
  • [24] Visualizing Document Similarity Using N-Grams and Latent Semantic Analysis
    Hussein, Ashraf S.
    [J]. PROCEEDINGS OF THE 2016 SAI COMPUTING CONFERENCE (SAI), 2016, : 269 - 279
  • [25] Classifying True and False Hebrew Stories Using Word N-Grams
    HaCohen-Kerner, Yaakov
    Dilmon, Rakefet
    Friedlich, Shimon
    Cohen, Daniel Nissim
    [J]. CYBERNETICS AND SYSTEMS, 2016, 47 (08) : 629 - 649
  • [26] Dissimilarities Detections in Texts Using Symbol n-grams and Word Histograms
    Andrejkova, Gabriela
    Almarimi, Abdulwahed
    [J]. OPEN COMPUTER SCIENCE, 2016, 6 (01): : 168 - 177
  • [27] Character N-Grams for Detecting Deceptive Controversial Opinions
    Sanchez-Junquera, Javier
    Villasenor-Pineda, Luis
    Montes-y-Gomez, Manuel
    Rosso, Paolo
    [J]. EXPERIMENTAL IR MEETS MULTILINGUALITY, MULTIMODALITY, AND INTERACTION (CLEF 2018), 2018, 11018 : 135 - 140
  • [28] Mining generalized character n-grams in large corpora
    Marques, NC
    Braud, A
    [J]. PROGRESS IN ARTIFICIAL INTELLIGENCE-B, 2003, 2902 : 419 - 423
  • [30] Relation Extraction with Word Graphs from N-grams
    Qin, Han
    Tian, Yuanhe
    Song, Yan
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2860 - 2868