CodeBERT for Code Clone Detection: A Replication Study

被引:4
|
作者
Arshad, Saad [1 ]
Abid, Shamsa [2 ]
Shamail, Shafay [1 ]
机构
[1] LUMS, Dept Comp Sci, SBASSE, Lahore, Pakistan
[2] Singapore Management Univ, Sch Comp & Informat Syst, Singapore, Singapore
关键词
Code Clone Detection; Semantic Code Clones; Deep-learning; CodeBERT; BigCloneBench; SemanticCloneBench; Android;
D O I
10.1109/IWSC55060.2022.00015
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Large pre-trained models have dramatically improved the state-of-the-art on a variety of natural language processing (NLP) tasks. CodeBERT is one such pre-trained model for natural language (NL) and programming language (PL) which captures the semantics in natural language and programming language, and produces general-purpose representations. While it has been shown to support natural language code search and code documentation generation tasks, its effectiveness for code clone detection is not explored in depth. In this paper, we aim to replicate and evaluate the performance of CodeBERT for code clone detection on multiple datasets with varying functionalities to understand (1) whether CodeBERT can generalize to unseen code, (2) how fine-tuning can effect CodeBERT's performance on unseen code, and (3) how CodeBERT performs for detecting various code clone types. To this end, we consider three different datasets of Java methods. We derive the first dataset from BigCloneBench. We use Java clone pairs from SemanticCloneBench to derive our second dataset, and our third dataset consists of Java methods from Android applications. Our experiments indicate that CodeBERT performs the best for detecting Type-1 and Type-4 clones with a 100% and 96% recall on average respectively. We also find that there is limited generalizability on unseen functionalities where recall drops by 15% and 40% on the SemanticCloneBench and Android datasets respectively. Furthermore, we observe that fine-tuning can improve the recall by 22% and 30% on the SemanticCloneBench and Android datasets respectively.
引用
收藏
页码:39 / 45
页数:7
相关论文
共 50 条
  • [31] Clone Detection in Test Code: An Empirical Evaluation
    van Bladel, Brent
    Demeyer, Serge
    PROCEEDINGS OF THE 2020 IEEE 27TH INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER '20), 2020, : 492 - 500
  • [32] Comparison and Visualization of Code Clone Detection Results
    Matsushima, Kazuki
    Inoue, Katsuro
    PROCEEDINGS OF THE 2020 IEEE 14TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC '20), 2020, : 45 - 51
  • [33] Semantic Code Clone Detection for Enterprise Applications
    Svacina, Jan
    Simmons, Jonathan
    Cerny, Tomas
    PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING (SAC'20), 2020, : 129 - 131
  • [34] To Enhance the Code Clone Detection Algorithm by using Hybrid Approach for detection of code clones
    Roopam
    Singh, Gurpreet
    2017 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2017, : 192 - 198
  • [35] An Empirical Study of Code Clone Clustering Based on Clone Evolution
    Fanlong Zhang
    Xiaohong Su
    Wen Zhao
    Tiantian Wang
    Journal of Harbin Institute of Technology(New series), 2017, (02) : 10 - 18
  • [36] An Empirical Study of Code Clone Clustering Based on Clone Evolution
    Fanlong Zhang
    Xiaohong Su
    Wen Zhao
    Tiantian Wang
    Journal of Harbin Institute of Technology, 2017, 24 (02) : 10 - 18
  • [37] A Novel Code Stylometry-based Code Clone Detection Strategy
    Dong, Wenyuan
    Feng, Zhiyong
    Wei, Hua
    Luo, Hong
    2020 16TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC, 2020, : 1516 - 1521
  • [38] Generic Code Cloning method for Detection of Clone Code in Software Development
    Haque, Syed Mohd Fazalul
    Srikanth, V.
    Reddy, E. Sreenivasa
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON DATA MINING AND ADVANCED COMPUTING (SAPIENCE), 2016, : 340 - 344
  • [39] A Study on Code Clone Evolution Analysis
    Wang, Ke
    Zhang, Liping
    Yan, Sheng
    PROCEEDINGS OF 2017 8TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS 2017), 2017, : 340 - 345
  • [40] A Comparative Study of Code Clone Genealogies in Test Code and Production Code
    Van Bladel, Brent
    Demeyer, Serge
    2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER, 2023, : 913 - 920