SAGA: Efficient and Large-Scale Detection of Near-Miss Clones with GPU Acceleration

被引:0
|
作者
Li, Guanhua [1 ,3 ]
Wu, Yijian [2 ,3 ]
Roy, Chanchal K. [4 ]
Sun, Jun [5 ]
Peng, Xin [2 ,3 ]
Zhan, Nanjie [2 ,3 ]
Hu, Bin [2 ,3 ]
Ma, Jingyi [2 ]
机构
[1] Fudan Univ, Software Sch, Shanghai, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[3] Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[4] Univ Saskatchewan, Saskatoon, SK, Canada
[5] Singapore Management Univ, Singapore, Singapore
基金
国家重点研发计划; 上海市科技启明星计划;
关键词
clone detection; near-miss clone; segment clone; GPU acceleration; big code; CODE; CCFINDER; SYSTEM;
D O I
10.1109/saner48275.2020.9054832
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Clone detection on large code repository is necessary for many big code analysis tasks. The goal is to provide rich information on identical and similar code across projects. Detecting near-miss code clones on big code is challenging since it requires intensive computing and memory resources as the scale of the source code increases. In this work, we propose SAGA, an efficient suffix-array based code clone detection tool designed with sophisticated GPU optimization. SAGA not only detects Type-1 and Type-2 clones but also does so for cross-project large repositories and for the most computationally expensive Type-3 clones. Meanwhile, it also works at segment granularity, which is even more challenging. It detects code clones in 100 million lines of code within 11 minutes (with recall and precision comparable to state-of-the-art approaches), which is more than 10 times faster than state-of-the-art tools. It is the only tool that efficiently detects Type-3 near-miss clones at segment granularity in large code repository (e.g., within 11 hours on 1 billion lines of code). We conduct a preliminary case study on 85,202 GitHub Java projects with 1 billion lines of code and exhibit the distribution of clones across projects. We find about 1.23 million Type-3 clone groups, containing 28 million lines of code at arbitrary segment granularity, which are only detectable with SAGA. We believe SAGA is useful in many software engineering applications such as code provenance analysis, code completion, change impact analysis, and many more.
引用
收藏
页码:272 / 283
页数:12
相关论文
共 50 条
  • [1] CloneWorks: A Fast and Flexible Large-Scale Near-Miss Clone Detection Tool
    Svajlenko, Jeffery
    Roy, Chanchal K.
    [J]. PROCEEDINGS OF THE 2017 IEEE/ACM 39TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C 2017), 2017, : 177 - 179
  • [2] Detection and Analysis of Near-Miss Software Clones
    Roy, Chanchal K.
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, CONFERENCE PROCEEDINGS, 2009, : 447 - 450
  • [3] Detection of near-miss clones using metrics and Abstract Syntax Trees
    Vishwachi
    Gupta, Sonam
    [J]. PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON INVENTIVE COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES (ICICCT), 2017, : 230 - 234
  • [4] Boosted Near-miss Under-sampling on SVM ensembles for concept detection in large-scale imbalanced datasets
    Bao, Lei
    Juan, Cao
    Li, Jintao
    Zhang, Yongdong
    [J]. NEUROCOMPUTING, 2016, 172 : 198 - 206
  • [5] GPU Acceleration of Zernike Moments for Large-scale Images
    Ujaldon, Manuel
    [J]. 2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-5, 2009, : 2033 - 2040
  • [6] GPU acceleration of ADMM for large-scale quadratic programming
    Schubiger, Michel
    Banjac, Goran
    Lygeros, John
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2020, 144 : 55 - 67
  • [7] Near-Miss Accidents - Classification and Automatic Detection
    Thallinger, Georg
    Krebs, Florian
    Kolla, Eduard
    Vertal, Peter
    Kasanicky, Gustav
    Neuschmied, Helmut
    Ambrosch, Karl-Ernst
    [J]. INTELLIGENT TRANSPORT SYSTEMS - FROM RESEARCH AND DEVELOPMENT TO THE MARKET UPTAKE, INTSYS 2017, 2018, 222 : 144 - 152
  • [8] Collective behavior of large-scale neural networks with GPU acceleration
    Jingyi Qu
    Rubin Wang
    [J]. Cognitive Neurodynamics, 2017, 11 : 553 - 563
  • [9] Collective behavior of large-scale neural networks with GPU acceleration
    Qu, Jingyi
    Wang, Rubin
    [J]. COGNITIVE NEURODYNAMICS, 2017, 11 (06) : 553 - 563
  • [10] Near-miss function clones in open source software: an empirical study
    Roy, C. K.
    Cordy, J. R.
    [J]. JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION-RESEARCH AND PRACTICE, 2010, 22 (03): : 165 - 189