CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models

被引:4
|
作者
Sun, Zhensu [1 ]
Du, Xiaoning [3 ]
Song, Fu [2 ,4 ,5 ]
Li, Li [1 ,4 ,5 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Software, State Key Lab Comp Sci, Beijing, Peoples R China
[3] Monash Univ, Melbourne, Vic, Australia
[4] Chinese Acad Sci, Beijing, Peoples R China
[5] Automot Software Innovat Center, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Neural code completion models; Watermarking; Code dataset;
D O I
10.1145/3611643.3616297
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code datasets are of immense value for training neural-network-based code completion models, where companies or organizations have made substantial investments to establish and process these datasets. Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc. Even worse, the "black-box" nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages. Currently, watermarking methods have been proposed to prohibit inappropriate usage of image and natural language datasets. However, due to domain specificity, they are not directly applicable to code datasets, leaving the copyright protection of this emerging and important field of code data still exposed to threats. To fill this gap, we propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models. CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers. We implement CodeMark in a toolkit and conduct an extensive evaluation of code completion models. CodeMark is validated to fulfill all desired properties of practical watermarks, including harmlessness to model accuracy, verifiability, robustness, and imperceptibility.
引用
收藏
页码:1561 / 1572
页数:12
相关论文
共 50 条
  • [21] A graph sequence neural architecture for code completion with semantic structure features
    Yang, Kang
    Yu, Huiqun
    Fan, Guisheng
    Yang, Xingguang
    Huang, Zijie
    JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2022, 34 (01)
  • [22] Neural code metrics: Analysis and application to the assessment of neural models
    Martins, Joao
    Tomas, Pedro
    Sousa, Leonel
    NEUROCOMPUTING, 2009, 72 (10-12) : 2337 - 2350
  • [23] Toward Less Hidden Cost of Code Completion with Acceptance and Ranking Models
    Li, Jingxuan
    Huang, Rui
    Li, Wei
    Yao, Kai
    Tan, Weiguo
    2021 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2021), 2021, : 195 - 205
  • [24] Ensemble Models for Neural Source Code Summarization of Subroutines
    LeClair, Alexander
    Bansal, Aakash
    McMillan, Collin
    2021 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2021), 2021, : 286 - 297
  • [25] Toward a Theory of Causation for Interpreting Neural Code Models
    Palacio, David Nader
    Velasco, Alejandro
    Cooper, Nathan
    Rodriguez, Alvaro
    Moran, Kevin
    Poshyvanyk, Denys
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2024, 50 (05) : 1215 - 1243
  • [26] Ensemble models for neural source code summarization of subroutines
    LeClair, Alexander
    Bansal, Aakash
    McMillan, Collin
    arXiv, 2021,
  • [27] On the Importance of Building High-quality Training Datasets for Neural Code Search
    Sun, Zhensu
    Li, Li
    Liu, Yan
    Du, Xiaoning
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1609 - 1620
  • [28] Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases
    Tang, Ze
    Ge, Jidong
    Liu, Shangqing
    Zhu, Tingwei
    Xu, Tongtong
    Huang, Liguo
    Luo, Bin
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 421 - 433
  • [29] Protection of the patient data against intentional attacks using a hybrid robust watermarking code
    Nagm, Ahmad
    Elwan, Mohammed Safy
    PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 21
  • [30] Error detecting code based fragile watermarking scheme for 3D models
    Wang, Jen-Tse
    Fan, Chen-Ming
    Huang, Cheng-Chih
    Li, Chu-Chuan
    2014 INTERNATIONAL SYMPOSIUM ON COMPUTER, CONSUMER AND CONTROL (IS3C 2014), 2014, : 1099 - 1102