DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

被引:0
|
作者
Lachaux, Marie-Anne [1 ]
Roziere, Baptiste [2 ]
Szafraniec, Marc [1 ]
Lample, Guillaume [1 ]
机构
[1] Facebook AI Res, New York, NY 10021 USA
[2] Paris Dauphine Univ, Paris, France
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 12.2% in unsupervised code translation, and 5.3% in natural language code search. Incidentally, we found that our pre-trained model is able to deobfuscate fully obfuscated source files, and to suggest descriptive variable names.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] eMLM: A New Pre-training Objective for Emotion Related Tasks
    Sosea, Tiberiu
    Caragea, Cornelia
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 286 - 293
  • [2] Pre-training via Leveraging Assisting Languages for Neural Machine Translation
    Song, Haiyue
    Dabre, Raj
    Mao, Zhuoyuan
    Cheng, Fei
    Kurohashi, Sadao
    Sumita, Eiichiro
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 279 - 285
  • [3] Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages
    Karouimu, Yasmine
    Lebret, Remi
    Foroutan, Negar
    Aberer, Karl
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 366 - 375
  • [4] Multi-stage Pre-training over Simplified Multimodal Pre-training Models
    Liu, Tongtong
    Feng, Fangxiang
    Wang, Xiaojie
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2556 - 2565
  • [5] Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages
    Dhar, Prajit
    Bisazza, Arianna
    van Noord, Gertjan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4933 - 4943
  • [6] Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks
    Dong, Haoyu
    Cheng, Zhoujun
    He, Xinyi
    Zhou, Mengyu
    Zhou, Anda
    Zhou, Fan
    Liu, Ao
    Han, Shi
    Zhang, Dongmei
    PROCEEDINGS OF THE THIRTY-FIRST INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2022, 2022, : 5426 - 5435
  • [7] Rethinking ImageNet Pre-training
    He, Kaiming
    Girshick, Ross
    Dollar, Piotr
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4917 - 4926
  • [8] Photo Pre-Training, But for Sketch
    Ke, L.
    Pang, Kaiyue
    Song, Yi-Zhe
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2754 - 2764
  • [9] Pre-Training to Learn in Context
    Gu, Yuxian
    Dong, Li
    Wei, Furu
    Huang, Minlie
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4849 - 4870
  • [10] Pre-training via Paraphrasing
    Lewis, Mike
    Ghazvininejad, Marjan
    Ghosh, Gargi
    Aghajanyan, Armen
    Wang, Sida
    Zettlemoyer, Luke
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33