CodeBERT for Code Clone Detection: A Replication Study

被引:4
|
作者
Arshad, Saad [1 ]
Abid, Shamsa [2 ]
Shamail, Shafay [1 ]
机构
[1] LUMS, Dept Comp Sci, SBASSE, Lahore, Pakistan
[2] Singapore Management Univ, Sch Comp & Informat Syst, Singapore, Singapore
关键词
Code Clone Detection; Semantic Code Clones; Deep-learning; CodeBERT; BigCloneBench; SemanticCloneBench; Android;
D O I
10.1109/IWSC55060.2022.00015
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Large pre-trained models have dramatically improved the state-of-the-art on a variety of natural language processing (NLP) tasks. CodeBERT is one such pre-trained model for natural language (NL) and programming language (PL) which captures the semantics in natural language and programming language, and produces general-purpose representations. While it has been shown to support natural language code search and code documentation generation tasks, its effectiveness for code clone detection is not explored in depth. In this paper, we aim to replicate and evaluate the performance of CodeBERT for code clone detection on multiple datasets with varying functionalities to understand (1) whether CodeBERT can generalize to unseen code, (2) how fine-tuning can effect CodeBERT's performance on unseen code, and (3) how CodeBERT performs for detecting various code clone types. To this end, we consider three different datasets of Java methods. We derive the first dataset from BigCloneBench. We use Java clone pairs from SemanticCloneBench to derive our second dataset, and our third dataset consists of Java methods from Android applications. Our experiments indicate that CodeBERT performs the best for detecting Type-1 and Type-4 clones with a 100% and 96% recall on average respectively. We also find that there is limited generalizability on unseen functionalities where recall drops by 15% and 40% on the SemanticCloneBench and Android datasets respectively. Furthermore, we observe that fine-tuning can improve the recall by 22% and 30% on the SemanticCloneBench and Android datasets respectively.
引用
收藏
页码:39 / 45
页数:7
相关论文
共 50 条
  • [1] Generalizability of Code Clone Detection on CodeBERT
    Sonnekalb, Tim
    Gruner, Bernd
    Brust, Clemens-Alexander
    Maeder, Patrick
    PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
  • [2] Generalizability of Code Clone Detection on CodeBERT
    Sonnekalb, Tim
    Gruner, Bernd
    Brust, Clemens-Alexander
    Mäder, Patrick
    arXiv, 2022,
  • [3] Generalizability of Code Clone Detection on CodeBERT
    Sonnekalb, Tim
    Gruner, Bernd
    Brust, Clemens-Alexander
    Mäder, Patrick
    ACM International Conference Proceeding Series, 2022,
  • [4] Interpreting CodeBERT for Semantic Code Clone Detection
    Abid, Shamsa
    Cai, Xuemeng
    Jiang, Lingxiao
    PROCEEDINGS OF THE 2023 30TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, APSEC 2023, 2023, : 229 - 238
  • [5] CodeBERT-nt: Code Naturalness via CodeBERT
    Khanfir, Ahmed
    Jimenez, Matthieu
    Papadakis, Mike
    Le Traon, Yves
    2022 IEEE 22ND INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2022, : 936 - 947
  • [6] CodeBERT Based Code Classification Method
    Cheng, Siqiang
    Liu, Jianxun
    Peng, Zhenlian
    Cao, Ben
    Computer Engineering and Applications, 2023, 59 (24) : 277 - 288
  • [7] Refactoring Code Clone Detection
    Othman, Zhala Sarkawt
    Kaya, Mehmet
    2019 7TH INTERNATIONAL SYMPOSIUM ON DIGITAL FORENSICS AND SECURITY (ISDFS), 2019,
  • [8] Prioritizing Code Clone Detection Results for Clone Management
    Venkatasubramanyam, Radhika D.
    Gupta, Shrinath
    Singh, Himanshu Kumar
    2013 7TH INTERNATIONAL WORKSHOP ON SOFTWARE CLONES (IWSC), 2013, : 30 - 36
  • [9] Case Study on Semantic Clone Detection Based On Code Behavior
    Priyambadha, Bayu
    Rochimah, Siti
    2014 International Conference on Data and Software Engineering (ICODSE), 2014,
  • [10] Deep Learning Code Fragments for Code Clone Detection
    White, Martin
    Tufano, Michele
    Vendome, Christopher
    Poshyvanyk, Denys
    2016 31ST IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2016, : 87 - 98