Siamese: scalable and incremental code clone search via multiple code representations

被引:37
|
作者
Ragkhitwetsagul, Chaiyong [1 ]
Krinke, Jens [2 ]
机构
[1] Mahidol Univ, Fac Informat & Commun Technol, 999 Phuttamonthon 4 Rd, Salaya 73170, Nakhon Pathom, Thailand
[2] UCL, Dept Comp Sci, Ctr Res Evolut Search & Testing, Gower St, London WC1E 6BT, England
关键词
Code clone search; Code search engine; SYSTEM;
D O I
10.1007/s10664-019-09697-7
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese's incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.
引用
收藏
页码:2236 / 2284
页数:49
相关论文
共 50 条
  • [41] Review Sharing via Deep Semi-Supervised Code Clone Detection
    Guo, Chenkai
    Yang, Hui
    Huang, Dengrong
    Zhang, Jianwen
    Dong, Naipeng
    Xu, Jing
    Zhu, Jingwen
    [J]. IEEE ACCESS, 2020, 8 : 24948 - 24965
  • [42] Two Birds with One Stone: Boosting Code Generation and Code Search via a Generative Adversarial Network
    Wang, Shangwen
    Lin, Bo
    Sun, Zhensu
    Wen, Ming
    Liu, Yepang
    Lei, Yan
    Mao, Xiaoguang
    [J]. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2023, 7 (OOPSLA):
  • [43] Enhance code search via reformulating queries with evolving contexts
    Qing Huang
    Guoqing Wu
    [J]. Automated Software Engineering, 2019, 26 : 705 - 732
  • [44] Enhance code search via reformulating queries with evolving contexts
    Huang, Qing
    Wu, Guoqing
    [J]. AUTOMATED SOFTWARE ENGINEERING, 2019, 26 (04) : 705 - 732
  • [45] Optical orthogonal code acquisition CDMA systems via multiple-stage multiple-shift search method
    Eslami, Hamid
    Abtahi, Mohammad
    [J]. IASTED International Conference on Optical Communication Systems and Networks, 2005, : 7 - 12
  • [46] SeClone - A Hybrid Approach to Internet-scale Real-time Code Clone Search
    Keivanloo, Iman
    Rilling, Juergen
    Charland, Philippe
    [J]. 2011 IEEE 19TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2011, : 223 - +
  • [47] An Efficient and Scalable Platform for Java']Java Source Code Analysis Using Overlaid Graph Representations
    Rodriguez-Prieto, Oscar
    Mycroft, Alan
    Ortin, Francisco
    [J]. IEEE ACCESS, 2020, 8 (08): : 72239 - 72260
  • [48] Semantically Enhanced Code Clone Refinement Algorithm Based on Analysis of Multiple Detection Reports
    Sotolongo, Ricardo
    Dong, Fangyan
    Hirota, Kaoru
    [J]. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2011, 15 (03) : 322 - 328
  • [49] Java bytecode clone detection via relaxation on code fingerprint and Semantic Web reasoning
    Keivanloo, Iman
    Roy, Chanchai K.
    Rilling, Juergen
    [J]. 2012 6th International Workshop on Software Clones, IWSC 2012 - Proceedings, 2012, : 36 - 42
  • [50] An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation
    Sato, Yukinori
    Yuki, Tomoya
    Endo, Toshio
    [J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2019, 15 (04)