SAGA: Efficient and Large-Scale Detection of Near-Miss Clones with GPU Acceleration

被引:0
|
作者
Li, Guanhua [1 ,3 ]
Wu, Yijian [2 ,3 ]
Roy, Chanchal K. [4 ]
Sun, Jun [5 ]
Peng, Xin [2 ,3 ]
Zhan, Nanjie [2 ,3 ]
Hu, Bin [2 ,3 ]
Ma, Jingyi [2 ]
机构
[1] Fudan Univ, Software Sch, Shanghai, Peoples R China
[2] Fudan Univ, Sch Comp Sci, Shanghai, Peoples R China
[3] Shanghai Key Lab Data Sci, Shanghai, Peoples R China
[4] Univ Saskatchewan, Saskatoon, SK, Canada
[5] Singapore Management Univ, Singapore, Singapore
基金
国家重点研发计划; 上海市科技启明星计划;
关键词
clone detection; near-miss clone; segment clone; GPU acceleration; big code; CODE; CCFINDER; SYSTEM;
D O I
10.1109/saner48275.2020.9054832
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Clone detection on large code repository is necessary for many big code analysis tasks. The goal is to provide rich information on identical and similar code across projects. Detecting near-miss code clones on big code is challenging since it requires intensive computing and memory resources as the scale of the source code increases. In this work, we propose SAGA, an efficient suffix-array based code clone detection tool designed with sophisticated GPU optimization. SAGA not only detects Type-1 and Type-2 clones but also does so for cross-project large repositories and for the most computationally expensive Type-3 clones. Meanwhile, it also works at segment granularity, which is even more challenging. It detects code clones in 100 million lines of code within 11 minutes (with recall and precision comparable to state-of-the-art approaches), which is more than 10 times faster than state-of-the-art tools. It is the only tool that efficiently detects Type-3 near-miss clones at segment granularity in large code repository (e.g., within 11 hours on 1 billion lines of code). We conduct a preliminary case study on 85,202 GitHub Java projects with 1 billion lines of code and exhibit the distribution of clones across projects. We find about 1.23 million Type-3 clone groups, containing 28 million lines of code at arbitrary segment granularity, which are only detectable with SAGA. We believe SAGA is useful in many software engineering applications such as code provenance analysis, code completion, change impact analysis, and many more.
引用
收藏
页码:272 / 283
页数:12
相关论文
共 50 条
  • [21] Efficient Large-scale Approximate Nearest Neighbor Search on the GPU
    Wieschollek, Patrick
    Wang, Oliver
    Sorkine-Hornung, Alexander
    Lensch, Hendrik P. A.
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 2027 - 2035
  • [22] Drive Video Analysis for the Detection of Traffic Near-Miss Incidents
    Kataoka, Hirokatsu
    Suzuki, Teppei
    Oikawa, Shoko
    Matsui, Yasuhiro
    Satoh, Yutaka
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2018, : 3421 - 3428
  • [23] Evaluation of SIMMARC: An Audiovisual System for the Detection of Near-Miss Accidents
    Krebs, Florian
    Thallinger, Georg
    Neuschmied, Helmut
    Graf, Franz
    Huber, Georg
    Fallast, Kurt
    Vertal, Peter
    Kolla, Eduard
    [J]. INTELLIGENT TRANSPORT SYSTEMS, 2020, 310 : 192 - 202
  • [24] SPAPE: A semantic-preserving amorphous procedure extraction method for near-miss clones
    Bian, Yixin
    Koru, Gunes
    Su, Xiaohong
    Ma, Peijun
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2013, 86 (08) : 2077 - 2093
  • [25] Acceleration of large-scale CGH generation using multi-GPU cluster
    Watanabe, Shinpei
    Jackin, Boaz Jessie
    Ohkawa, Takeshi
    Ootsu, Kanemitsu
    Yokota, Takashi
    Hayasaki, Yoshio
    Yatagai, Toyohiko
    Baba, Takanobu
    [J]. 2017 FIFTH INTERNATIONAL SYMPOSIUM ON COMPUTING AND NETWORKING (CANDAR), 2017, : 589 - 593
  • [26] GPU Acceleration of Hydraulic Transient Simulations of Large-Scale Water Supply Systems
    Meng, Wanwan
    Cheng, Yongguang
    Wu, Jiayang
    Yang, Zhiyan
    Zhu, Yunxian
    Shang, Shuai
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (01):
  • [27] Efficient Large-Scale Stance Detection in Tweets
    Yan, Yilin
    Chen, Jonathan
    Shyu, Mei-Ling
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA DATA ENGINEERING & MANAGEMENT, 2018, 9 (03): : 1 - 16
  • [28] Large-Scale Pairwise Sequence Alignments on a Large-Scale GPU Cluster
    Savran, Ibrahim
    Gao, Yang
    Bakos, Jason D.
    [J]. IEEE DESIGN & TEST, 2014, 31 (01) : 51 - 61
  • [29] Models are Code too: Near-miss Clone Detection for Simulink Models
    Alalfi, Manar H.
    Cordy, James R.
    Dean, Thomas R.
    Stephan, Matthew
    Stevenson, Andrew
    [J]. 2012 28TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), 2012, : 295 - 304
  • [30] 3D large-scale SPH modeling of vehicle wading with GPU acceleration
    Huashan Zhang
    Xiaoxiao Li
    Kewei Feng
    Moubin Liu
    [J]. Science China(Physics,Mechanics & Astronomy), 2023, Mechanics & Astronomy)2023 (10) : 74 - 95