Measuring similarity between Karel programs using character and word n-grams

被引:0
|
作者
G. Sidorov
M. Ibarra Romero
I. Markov
R. Guzman-Cabrera
L. Chanona-Hernández
F. Velásquez
机构
[1] Center for Computing Research (CIC),Instituto Politécnico Nacional (IPN)
[2] University of Guanajuato,Engineering Division
[3] Campus Irapuato-Salamanca,Instituto Politécnico Nacional
[4] School of Mechanical and Electrical Engineering (ESIME),undefined
[5] Polytechnic University of Queretaro,undefined
来源
关键词
machine learning; similarity; Karel programming language; character ; -grams; word ; -grams; SVM; LSA;
D O I
暂无
中图分类号
学科分类号
摘要
We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.
引用
收藏
页码:47 / 50
页数:3
相关论文
共 50 条
  • [32] Relation Extraction with Word Graphs from N-grams
    Qin, Han
    Tian, Yuanhe
    Song, Yan
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 2860 - 2868
  • [33] Arabic Document Similarity Analysis using N-grams and Singular Value Decomposition
    Hussein, Ashraf S.
    [J]. 2015 IEEE 9TH INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN INFORMATION SCIENCE (RCIS), 2015, : 445 - 455
  • [34] Using character n-grams to match a list of publications to references in bibliographic databases
    Abdulhayoglu, Mehmet Ali
    Thijs, Bart
    Jeuris, Wouter
    [J]. SCIENTOMETRICS, 2016, 109 (03) : 1525 - 1546
  • [35] Using character n-grams to match a list of publications to references in bibliographic databases
    Mehmet Ali Abdulhayoglu
    Bart Thijs
    Wouter Jeuris
    [J]. Scientometrics, 2016, 109 : 1525 - 1546
  • [36] Social Network Multilingual Author Profiling using character and POS n-grams
    Gonzalez-Gallardo, Carlos-Emiliano
    Torres-Moreno, Juan-Manuel
    Rendon, Azucena Montes
    Sierra, Gerardo
    [J]. LINGUAMATICA, 2016, 8 (01): : 21 - 29
  • [37] Webpage genre identification using variable-length character n-grams
    Kanaris, Ioannis
    Stamatatos, Efstathios
    [J]. 19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL II, PROCEEDINGS, 2007, : 3 - +
  • [38] Malware Detection and Classification Based on n-grams Attribute Similarity
    Zhang Fuyong
    Zhao Tiezhou
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE) AND IEEE/IFIP INTERNATIONAL CONFERENCE ON EMBEDDED AND UBIQUITOUS COMPUTING (EUC), VOL 1, 2017, : 793 - 796
  • [39] A comparison of character n-grams and dictionaries used for script recognition
    Brakensiek, A
    Rigoll, G
    [J]. SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 241 - 245
  • [40] On the use of character n-grams as the only intrinsic evidence of plagiarism
    Bensalem, Imene
    Rosso, Paolo
    Chikhi, Salim
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (03) : 363 - 396