Measuring similarity between Karel programs using character and word n-grams

被引:0
|
作者
G. Sidorov
M. Ibarra Romero
I. Markov
R. Guzman-Cabrera
L. Chanona-Hernández
F. Velásquez
机构
[1] Center for Computing Research (CIC),Instituto Politécnico Nacional (IPN)
[2] University of Guanajuato,Engineering Division
[3] Campus Irapuato-Salamanca,Instituto Politécnico Nacional
[4] School of Mechanical and Electrical Engineering (ESIME),undefined
[5] Polytechnic University of Queretaro,undefined
来源
关键词
machine learning; similarity; Karel programming language; character ; -grams; word ; -grams; SVM; LSA;
D O I
暂无
中图分类号
学科分类号
摘要
We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.
引用
收藏
页码:47 / 50
页数:3
相关论文
共 50 条
  • [1] Measuring similarity between Karel programs using character and word n-grams
    Sidorov, G.
    Ibarra Romero, M.
    Markov, I.
    Guzman-Cabrera, R.
    Chanona-Hernandez, L.
    Velasquez, F.
    [J]. PROGRAMMING AND COMPUTER SOFTWARE, 2017, 43 (01) : 47 - 50
  • [2] Which Granularity to Bootstrap a Multilingual Method of Document Alignment: Character N-grams or Word N-grams?
    Lecluze, Charlotte
    Rigouste, Lois
    Giguet, Emmanuel
    Lucas, Nadine
    [J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 473 - 481
  • [3] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
    ISOTANI, R
    MATSUNAGA, S
    SAGAYAMA, S
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
  • [4] Automatic word spacing using probabilistic models based on character n-grams
    Lee, Do-Gil
    Rim, Hae-Chang
    Yook, Dongsuk
    [J]. IEEE INTELLIGENT SYSTEMS, 2007, 22 (01) : 28 - 35
  • [5] Combining Word and Character N-grams for Detecting Deceptive Opinions
    Siagian, Al Hafiz Akbar Maulana
    Aritsugi, Masayoshi
    [J]. 2017 IEEE 41ST ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2017, : 828 - 833
  • [6] Spam detection using character N-grams
    Kanaris, Ioannis
    Kanaris, Konstantinos
    Stamatatos, Efstathios
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
  • [7] IDF for Word N-grams
    Shirakawa, Masumi
    Hara, Takahiro
    Nishio, Shojiro
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2017, 36 (01)
  • [8] Comparing word, character, and phoneme n-grams for subjective utterance recognition
    Wilson, Theresa
    Raaijmakers, Stephan
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 1614 - +
  • [9] Authorship Attribution in Portuguese Using Character N-grams
    Markov, Ilia
    Baptista, Jorge
    Pichardo-Lagunas, Obdulia
    [J]. ACTA POLYTECHNICA HUNGARICA, 2017, 14 (03) : 59 - 78
  • [10] The subjective frequency of word n-grams
    Shaoul, Cyrus
    Westbury, Chris F.
    Baayen, R. Harald
    [J]. PSIHOLOGIJA, 2013, 46 (04) : 497 - 537