CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models

被引：4

作者：

Sun, Zhensu ^{[1
]}

Du, Xiaoning ^{[3
]}

Song, Fu ^{[2
,4
,5
]}

Li, Li ^{[1
,4
,5
]}

机构：

[1] Beihang Univ, Beijing, Peoples R China

[2] Chinese Acad Sci, Inst Software, State Key Lab Comp Sci, Beijing, Peoples R China

[3] Monash Univ, Melbourne, Vic, Australia

[4] Chinese Acad Sci, Beijing, Peoples R China

[5] Automot Software Innovat Center, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Neural code completion models; Watermarking; Code dataset;

D O I：

10.1145/3611643.3616297

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Code datasets are of immense value for training neural-network-based code completion models, where companies or organizations have made substantial investments to establish and process these datasets. Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc. Even worse, the "black-box" nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages. Currently, watermarking methods have been proposed to prohibit inappropriate usage of image and natural language datasets. However, due to domain specificity, they are not directly applicable to code datasets, leaving the copyright protection of this emerging and important field of code data still exposed to threats. To fill this gap, we propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models. CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers. We implement CodeMark in a toolkit and conduct an extensive evaluation of code completion models. CodeMark is validated to fulfill all desired properties of practical watermarks, including harmlessness to model accuracy, verifiability, robustness, and imperceptibility.

引用

页码：1561 / 1572

页数：12

共 50 条

[21] A graph sequence neural architecture for code completion with semantic structure features
Yang, Kang
Yu, Huiqun
Fan, Guisheng
Yang, Xingguang
Huang, Zijie
JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2022, 34 (01)
[22] Neural code metrics: Analysis and application to the assessment of neural models
Martins, Joao
Tomas, Pedro
Sousa, Leonel
NEUROCOMPUTING, 2009, 72 (10-12) : 2337 - 2350
[23] Toward Less Hidden Cost of Code Completion with Acceptance and Ranking Models
Li, Jingxuan
Huang, Rui
Li, Wei
Yao, Kai
Tan, Weiguo
2021 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2021), 2021, : 195 - 205
[24] Ensemble Models for Neural Source Code Summarization of Subroutines
LeClair, Alexander
Bansal, Aakash
McMillan, Collin
2021 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2021), 2021, : 286 - 297
[25] Toward a Theory of Causation for Interpreting Neural Code Models
Palacio, David Nader
Velasco, Alejandro
Cooper, Nathan
Rodriguez, Alvaro
Moran, Kevin
Poshyvanyk, Denys
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2024, 50 (05) : 1215 - 1243
[26] Ensemble models for neural source code summarization of subroutines
LeClair, Alexander
Bansal, Aakash
McMillan, Collin
arXiv, 2021,
[27] On the Importance of Building High-quality Training Datasets for Neural Code Search
Sun, Zhensu
Li, Li
Liu, Yan
Du, Xiaoning
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1609 - 1620
[28] Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases
Tang, Ze
Ge, Jidong
Liu, Shangqing
Zhu, Tingwei
Xu, Tongtong
Huang, Liguo
Luo, Bin
2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 421 - 433
[29] Protection of the patient data against intentional attacks using a hybrid robust watermarking code
Nagm, Ahmad
Elwan, Mohammed Safy
PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 21
[30] Error detecting code based fragile watermarking scheme for 3D models
Wang, Jen-Tse
Fan, Chen-Ming
Huang, Cheng-Chih
Li, Chu-Chuan
2014 INTERNATIONAL SYMPOSIUM ON COMPUTER, CONSUMER AND CONTROL (IS3C 2014), 2014, : 1099 - 1102

← 1 2 3 4 5 →