Siamese: scalable and incremental code clone search via multiple code representations

被引:37
|
作者
Ragkhitwetsagul, Chaiyong [1 ]
Krinke, Jens [2 ]
机构
[1] Mahidol Univ, Fac Informat & Commun Technol, 999 Phuttamonthon 4 Rd, Salaya 73170, Nakhon Pathom, Thailand
[2] UCL, Dept Comp Sci, Ctr Res Evolut Search & Testing, Gower St, London WC1E 6BT, England
关键词
Code clone search; Code search engine; SYSTEM;
D O I
10.1007/s10664-019-09697-7
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese's incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.
引用
收藏
页码:2236 / 2284
页数:49
相关论文
共 50 条
  • [31] Improving Cross-Language Code Clone Detection via Code Representation Learning and Graph Neural Networks
    Mehrotra, Nikita
    Sharma, Akash
    Jindal, Anmol
    Purandare, Rahul
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2023, 49 (11) : 4846 - 4868
  • [32] Reducing accidental clones using instant clone search in automatic code review
    Balachandran, Vipin
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2020), 2020, : 781 - 783
  • [33] Clone-Hunter: Accelerated Bound Checks Elimination via Binary Code Clone Detection
    Xue, Hongfa
    Venkataramani, Guru
    Lan, Tian
    [J]. MAPL'18: PROCEEDINGS OF THE 2ND ACM SIGPLAN INTERNATIONAL WORKSHOP ON MACHINE LEARNING AND PROGRAMMING LANGUAGES, 2018, : 11 - 19
  • [34] FastDCF: A Partial Index Based Distributed and Scalable Near-Miss Code Clone Detection Approach for Very Large Code Repositories
    Yang, Liming
    Ren, Yi
    Guan, Jianbo
    Li, Bao
    Ma, Jun
    Han, Peng
    Tan, Yusong
    [J]. PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES, PDCAT 2021, 2022, 13148 : 210 - 222
  • [35] CCRep: Learning Code Change Representations via Pre-Trained Code Model and Query Back
    Liu, Zhongxin
    Tang, Zhijie
    Xia, Xin
    Yang, Xiaohu
    [J]. 2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, 2023, : 17 - 29
  • [36] Query Expansion via Wordnet for Effective Code Search
    Lu, Meili
    Sun, Xiaobing
    Wang, Shaowei
    Lo, David
    Duan, Yucong
    [J]. 2015 22ND INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER), 2015, : 545 - 549
  • [37] Incremental Annotate-Generalize-Search Framework for Interactive Source Code Comprehension
    Nakayama, Ken
    Tano, Shun'ichi
    Hashiyama, Tomonori
    Sakai, Eko
    [J]. 2017 IEEE 41ST ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2017, : 311 - 316
  • [38] DroidCC: A Scalable Clone Detection Approach for Android Applications to Detect Similarity at Source Code Level
    Akram, Junaid
    Shi, Zhendong
    Mumtaz, Majid
    Ping, Luo
    [J]. 2018 IEEE 42ND ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2018, : 100 - 105
  • [39] Hot Clones: Combining Search-Driven Development, Clone Management, and Code Provenance
    Schwarz, Niko
    [J]. 2012 34TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2012, : 1628 - 1629
  • [40] Semantic Code Clone Detection Via Event Embedding Tree and GAT Network
    Li, Bingzhuo
    Ye, Chunyang
    Guan, Shouyang
    Zhou, Hui
    [J]. 2020 IEEE 20TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY, AND SECURITY (QRS 2020), 2020, : 382 - 393