XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training

被引:5
|
作者
Lin, Zehao [1 ]
Li, Guodun [1 ]
Zhang, Jingfeng [1 ]
Deng, Yue [1 ]
Zeng, Xiangji [1 ]
Zhang, Yin [1 ]
Wan, Yao [2 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Tech, Wuhan 430027, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep learning; neural networks; code representation; cross-language; pre-training; LEARN;
D O I
10.1145/3506696
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCoDE) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.
引用
收藏
页数:44
相关论文
共 50 条
  • [41] Alternating Language Modeling for Cross-Lingual Pre-Training
    Yang, Jian
    Ma, Shuming
    Zhang, Dongdong
    Wu, Shuangzhi
    Li, Zhoujun
    Zhou, Ming
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9386 - 9393
  • [42] Superpixel semantics representation and pre-training for vision-language tasks
    Zhang, Siyu
    Chen, Yeming
    Sun, Yaoru
    Wang, Fang
    Yang, Jun
    Bai, Lizhi
    Gao, Shangce
    NEUROCOMPUTING, 2025, 615
  • [43] TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training
    Liu, Yulong
    Zhu, Guibo
    Zhu, Bin
    Song, Qi
    Ge, Guojing
    Chen, Haoran
    Qiao, Guanhui
    Peng, Ru
    Wu, Lingxiang
    Wang, Jinqiao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [44] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation
    Wen, Keyu
    Xia, Jin
    Huang, Yuanyuan
    Li, Linyang
    Xu, Jiayan
    Shao, Jie
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2188 - 2197
  • [45] Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization
    Du, Yali
    Yu, Zhongxing
    PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, : 579 - 591
  • [46] WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction
    Wu, Qiyu
    Nagata, Masaaki
    Tsuruoka, Yoshimasa
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11084 - 11099
  • [47] Editorial for Special Issue on Large-scale Pre-training: Data, Models, and Fine-tuning
    Wen, Ji-Rong
    Huang, Zi
    Zhang, Hanwang
    MACHINE INTELLIGENCE RESEARCH, 2023, 20 (02) : 145 - 146
  • [48] A Comparison between Pre-training and Large-scale Back-translation for Neural Machine Translation
    Huang, Dandan
    Wang, Kun
    Zhang, Yue
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 1718 - 1732
  • [49] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
    Chen, Sanyuan
    Wang, Chengyi
    Chen, Zhengyang
    Wu, Yu
    Liu, Shujie
    Chen, Zhuo
    Li, Jinyu
    Kanda, Naoyuki
    Yoshioka, Takuya
    Xiao, Xiong
    Wu, Jian
    Zhou, Long
    Ren, Shuo
    Qian, Yanmin
    Qian, Yao
    Zeng, Michael
    Yu, Xiangzhan
    Wei, Furu
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1505 - 1518
  • [50] Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception
    Chen, Haoming
    Zhang, Zhizhong
    Qu, Yanyun
    Zhang, Ruixin
    Tan, Xin
    Xie, Yuan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 19925 - 19935