CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models

被引:4
|
作者
Sun, Zhensu [1 ]
Du, Xiaoning [3 ]
Song, Fu [2 ,4 ,5 ]
Li, Li [1 ,4 ,5 ]
机构
[1] Beihang Univ, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Software, State Key Lab Comp Sci, Beijing, Peoples R China
[3] Monash Univ, Melbourne, Vic, Australia
[4] Chinese Acad Sci, Beijing, Peoples R China
[5] Automot Software Innovat Center, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Neural code completion models; Watermarking; Code dataset;
D O I
10.1145/3611643.3616297
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Code datasets are of immense value for training neural-network-based code completion models, where companies or organizations have made substantial investments to establish and process these datasets. Unluckily, these datasets, either built for proprietary or public usage, face the high risk of unauthorized exploits, resulting from data leakages, license violations, etc. Even worse, the "black-box" nature of neural models sets a high barrier for externals to audit their training datasets, which further connives these unauthorized usages. Currently, watermarking methods have been proposed to prohibit inappropriate usage of image and natural language datasets. However, due to domain specificity, they are not directly applicable to code datasets, leaving the copyright protection of this emerging and important field of code data still exposed to threats. To fill this gap, we propose a method, named CodeMark, to embed user-defined imperceptible watermarks into code datasets to trace their usage in training neural code completion models. CodeMark is based on adaptive semantic-preserving transformations, which preserve the exact functionality of the code data and keep the changes covert against rule-breakers. We implement CodeMark in a toolkit and conduct an extensive evaluation of code completion models. CodeMark is validated to fulfill all desired properties of practical watermarks, including harmlessness to model accuracy, verifiability, robustness, and imperceptibility.
引用
收藏
页码:1561 / 1572
页数:12
相关论文
共 50 条
  • [1] Code Completion with Statistical Language Models
    Raychev, Veselin
    Vechev, Martin
    Yahav, Eran
    ACM SIGPLAN NOTICES, 2014, 49 (06) : 419 - 428
  • [2] Don't Complete It! Preventing Unhelpful Code Completion for Productive and Sustainable Neural Code Completion Systems
    Sun, Zhensu
    Du, Xiaoning
    Song, Fu
    Wang, Shangwen
    Ni, Mingze
    Li, Li
    2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS, ICSE-COMPANION, 2023, : 324 - 325
  • [3] Don't Complete It! Preventing Unhelpful Code Completion for Productive and Sustainable Neural Code Completion Systems
    Sun, Zhensu
    Du, Xiaoning
    Song, Fu
    Wang, Shangwen
    Ni, Mingze
    Li, Li
    Lo, David
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2025, 34 (01)
  • [4] Code Completion with Neural Attention and Pointer Networks
    Li, Jian
    Wang, Yue
    Lyu, Michael R.
    King, Irwin
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4159 - 4165
  • [5] Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study
    van Dam, Tim
    Izadi, Maliheh
    van Deursen, Arie
    2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 170 - 182
  • [6] A methodology for refined evaluation of neural code completion approaches
    Le, Kim Tuyen
    Rashidi, Gabriel
    Andrzejak, Artur
    DATA MINING AND KNOWLEDGE DISCOVERY, 2023, 37 (01) : 167 - 204
  • [7] Specializing Neural Networks for Cryptographic Code Completion Applications
    Xiao, Ya
    Song, Wenjia
    Qi, Jingyuan
    Viswanath, Bimal
    McDaniel, Patrick
    Yao, Danfeng
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2023, 49 (06) : 3524 - 3535
  • [8] A methodology for refined evaluation of neural code completion approaches
    Kim Tuyen Le
    Gabriel Rashidi
    Artur Andrzejak
    Data Mining and Knowledge Discovery, 2023, 37 : 167 - 204
  • [9] Fast and Memory-Efficient Neural Code Completion
    Svyatkovskiy, Alexey
    Lee, Sebastian
    Hadjitofi, Anna
    Riechert, Maik
    Franco, Juliana Vicente
    Allamanis, Miltiadis
    2021 IEEE/ACM 18TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2021), 2021, : 329 - 340
  • [10] An Empirical Study on the Usage of BERT Models for Code Completion
    Ciniselli, Matteo
    Cooper, Nathan
    Pascarella, Luca
    Poshyvanyk, Denys
    Di Penta, Massimiliano
    Bavota, Gabriele
    2021 IEEE/ACM 18TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2021), 2021, : 108 - 119