Siamese: scalable and incremental code clone search via multiple code representations

被引:37
|
作者
Ragkhitwetsagul, Chaiyong [1 ]
Krinke, Jens [2 ]
机构
[1] Mahidol Univ, Fac Informat & Commun Technol, 999 Phuttamonthon 4 Rd, Salaya 73170, Nakhon Pathom, Thailand
[2] UCL, Dept Comp Sci, Ctr Res Evolut Search & Testing, Gower St, London WC1E 6BT, England
关键词
Code clone search; Code search engine; SYSTEM;
D O I
10.1007/s10664-019-09697-7
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese's incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.
引用
收藏
页码:2236 / 2284
页数:49
相关论文
共 50 条
  • [1] Siamese: scalable and incremental code clone search via multiple code representations
    Chaiyong Ragkhitwetsagul
    Jens Krinke
    [J]. Empirical Software Engineering, 2019, 24 : 2236 - 2284
  • [2] Scalable code clone search for malware analysis
    Farhadi, Mohammad Reza
    Fung, Benjamin C. M.
    Fung, Yin Bun
    Charland, Philippe
    Preda, Stere
    Debbabi, Mourad
    [J]. DIGITAL INVESTIGATION, 2015, 15 : 46 - 60
  • [3] Index-Based Code Clone Detection: Incremental, Distributed, Scalable
    Hummel, Benjamin
    Juergens, Elmar
    Heinemann, Lars
    Conradt, Michael
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, 2010,
  • [4] Scalable code clone detection and search based on adaptive prefix filtering
    Nishi, Manziba Akanda
    Damevski, Kostadin
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2018, 137 : 130 - 142
  • [5] A Survey of Approaches for Code Clone Search
    Choi E.
    Mizuno O.
    Fujiwara Y.
    Yoshida N.
    [J]. Computer Software, 2022, 39 (03): : 47 - 59
  • [6] VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery
    Kim, Seulbae
    Woo, Seunghoon
    Lee, Heejo
    Oh, Hakjoo
    [J]. 2017 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP), 2017, : 595 - 614
  • [7] SIAMBERT: Siamese Bert-based Code Search
    Pena, Francisco J.
    Gonzalez, Angel Luis
    Pashami, Sepideh
    Al-Shishtawy, Ahmad
    Payberah, Amir H.
    [J]. 2022 34TH WORKSHOP OF THE SWEDISH ARTIFICIAL INTELLIGENCE SOCIETY (SAIS 2022), 2022, : 64 - 70
  • [8] Cross-Language Code Similarity and Applications in Clone Detection and Code Search
    Mathew, George Varghese
    [J]. ProQuest Dissertations and Theses Global, 2022,
  • [9] Clone-Seeker: Effective Code Clone Search Using Annotations
    Hammad, Muhammad
    Babur, Onder
    Basit, Hamid Abdul
    Van den Brand, Mark
    [J]. IEEE ACCESS, 2022, 10 : 11696 - 11713
  • [10] Scalable Image Search with Reliable Binary Code
    Ren, Guangxin
    Cai, Junjie
    Li, Shipeng
    Yu, Nenghai
    Tian, Qi
    [J]. PROCEEDINGS OF THE 2014 ACM CONFERENCE ON MULTIMEDIA (MM'14), 2014, : 769 - 772