A machine learning based framework for code clone validation

被引:8
|
作者
Mostaeen, Golam [1 ]
Roy, Banani [1 ]
Roy, Chanchal K. [1 ]
Schneider, Kevin [1 ]
Svajlenko, Jeffrey [2 ]
机构
[1] Univ Saskatchewan, Saskatoon, SK, Canada
[2] GitHub Inc, San Francisco, CA USA
基金
加拿大自然科学与工程研究理事会;
关键词
Code clones; Validation; Machine learning; Clone management; SYSTEM;
D O I
10.1016/j.jss.2020.110686
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, several code clone detection techniques and tools have been proposed and studied over the last decade. However, the clone detection tools are not always perfect and their clone detection reports often contain a number of false positives or irrelevant clones from specific project management or user perspective. To detect all possible similar source code patterns in general, the clone detection tools work on the syntax level while lacking user-specific preferences. This often means the clones must be manually inspected before analysis in order to remove those false positives from consideration. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning approach for automating the validation process. First, a training dataset is built by taking code clones from several clone detection tools for different subject systems and then manually validating those clones. Second, several features are extracted from those clones to train the machine learning model by the proposed approach. The trained algorithm is then used to automatically validate clones without human inspection. Thus the proposed approach can be used to remove the false positive clones from the detection results, automatically evaluate the precision of any clone detectors for any given set of datasets, evaluate existing clone benchmark datasets, or even be used to build new clone benchmarks and datasets with minimum effort. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method also shows better results in several comparative studies with the existing related approaches for clone classification. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] A Machine Learning Based Framework for Adaptive Mobile Learning
    Al-Hmouz, Ahmed
    Shen, Jun
    Yan, Jun
    [J]. ADVANCES IN WEB BASED LEARNING - ICWL 2009, 2009, 5686 : 34 - 43
  • [22] A Mutation/Injection-based Automatic Framework for Evaluating Code Clone Detection Tools
    Roy, Chanchal K.
    Cordy, James R.
    [J]. ICSTW 2009: IEEE INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION, AND VALIDATION WORKSHOPS, 2009, : 157 - 166
  • [23] An Empirical Study of Code Clone Clustering Based on Clone Evolution
    Fanlong Zhang
    Xiaohong Su
    Wen Zhao
    Tiantian Wang
    [J]. Journal of Harbin Institute of Technology(New series), 2017, (02) : 10 - 18
  • [24] Classification model for code clones based on machine learning
    Jiachen Yang
    Keisuke Hotta
    Yoshiki Higo
    Hiroshi Igaki
    Shinji Kusumoto
    [J]. Empirical Software Engineering, 2015, 20 : 1095 - 1125
  • [25] Classification model for code clones based on machine learning
    Yang, Jiachen
    Hotta, Keisuke
    Higo, Yoshiki
    Igaki, Hiroshi
    Kusumoto, Shinji
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2015, 20 (04) : 1095 - 1125
  • [26] Edge computing clone node recognition system based on machine learning
    Xiang Xiao
    Ming Zhao
    [J]. Neural Computing and Applications, 2022, 34 : 9289 - 9300
  • [27] Edge computing clone node recognition system based on machine learning
    Xiao, Xiang
    Zhao, Ming
    [J]. NEURAL COMPUTING & APPLICATIONS, 2022, 34 (12): : 9289 - 9300
  • [28] A Clone Management Framework to Improve Code Quality of FOSS Projects
    Shahzad, Sara
    Hussain, Ammara
    Nazir, Shah
    [J]. PROCEEDINGS OF 2017 INTERNATIONAL CONFERENCE ON COMMUNICATION, COMPUTING AND DIGITAL SYSTEMS (C-CODE), 2017, : 253 - 258
  • [29] SSA-HIAST: A Novel Framework for Code Clone Detection
    Saini, Neha
    Singh, Sukhdip
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (02): : 2999 - 3017
  • [30] Learning by Arguing in Argument-Based Machine Learning Framework
    Guid, Matej
    Mozina, Martin
    Pavlic, Matevz
    Tursic, Klemen
    [J]. INTELLIGENT TUTORING SYSTEMS (ITS 2019), 2019, 11528 : 112 - 122