A parallel and efficient approach to large scale clone detection

被引:14
|
作者
Sajnani, Hitesh [1 ]
Saini, Vaibhav [1 ]
Lopes, Cristina [1 ]
机构
[1] Univ Calif Irvine, Donald Bren Sch Informat & Comp Sci, Irvine, CA 92697 USA
基金
美国国家科学基金会;
关键词
clone detection; large scale; parallel; mapreduce; index-based; SYSTEM; CODE;
D O I
10.1002/smr.1707
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We propose a new token-based approach for large -scale code clone detection, which is based on a filtering heuristic that reduces the number of token comparisons when the two code blocks are compared. We also present a MapReduce based parallel algorithm that uses the filtering heuristic and scales to thousands of projects. The filtering heuristic is generic and can also be used in conjunction with other token-based approaches. In that context, we demonstrate how it can increase the retrieval speed and decrease the memory usage of the index-based approaches. In our experiments on 36 open source Java projects, we found that: (i) filtering reduces token comparisons by a factor of 10, and thus increasing the speed of clone detection by a factor of 1.5; (ii) the speed-up and scale-up of the parallel approach using filtering is near-linear on a cluster of 2-32 nodes for 150-2800 projects; and (iii) filtering decreases the memory usage of index-based approach by half and the search time by a factor of 5. Copyright (C) 2015 John Wiley & Sons, Ltd.
引用
收藏
页码:402 / 429
页数:28
相关论文
共 50 条
  • [1] A Parallel and Efficient Approach to Large Scale Clone Detection
    Sajnani, Hitesh
    Lopes, Cristina
    2013 7TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC), 2013, : 46 - 52
  • [2] An Efficient Parallel Approach of Parsing and Indexing for Large-scale XML Datasets
    Song, Kunfang
    Lu, Hongwei
    Qin, Xiao
    2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 184 - 191
  • [3] An Efficient New Multi-Language Clone Detection Approach from Large Source Code
    Rehman, Saif Ur
    Khan, Kamran
    Fong, Simon
    Biuk-Aghai, Robert
    PROCEEDINGS 2012 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2012, : 937 - 940
  • [4] Fast and Flexible Large-Scale Clone Detection with CloneWorks
    Svajlenko, Jeffrey
    Roy, Chanchal K.
    PROCEEDINGS OF THE 2017 IEEE/ACM 39TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C 2017), 2017, : 27 - 30
  • [5] An efficient parallel clustering algorithm for large scale database
    School of Electronic Information, Wuhan University, Wuhan, Hubei, China
    不详
    不详
    J. Softw., 2009, 10 (1119-1126):
  • [6] An Efficient Parallel Approach for Identifying Protein Families in Large-scale Metagenomic Data Sets
    Wu, Changjun
    Kalyanaraman, Ananth
    INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2008, : 362 - 371
  • [7] Efficient Community Detection in Large Scale Networks
    Vieira, Vinicius da F.
    Xavier, Carolina R.
    Evsukoff, Alexandre G.
    2013 1ST BRICS COUNTRIES CONGRESS ON COMPUTATIONAL INTELLIGENCE AND 11TH BRAZILIAN CONGRESS ON COMPUTATIONAL INTELLIGENCE (BRICS-CCI & CBIC), 2013, : 669 - 674
  • [8] An efficient approach for large scale graph partitioning
    Renzo Zamprogno
    André R. S. Amaral
    Journal of Combinatorial Optimization, 2007, 13
  • [9] An efficient approach for large scale graph partitioning
    Loureiro, Renzo Z.
    Amaral, Andre R. S.
    JOURNAL OF COMBINATORIAL OPTIMIZATION, 2007, 13 (04) : 289 - 320
  • [10] Efficient parallel simulation of large-scale PCS networks
    Boukerche, A
    Das, SK
    Fabbri, A
    Yildiz, O
    TRANSACTIONS OF THE SOCIETY FOR COMPUTER SIMULATION INTERNATIONAL, 1999, 16 (03): : 113 - 125