XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training

被引:5
|
作者
Lin, Zehao [1 ]
Li, Guodun [1 ]
Zhang, Jingfeng [1 ]
Deng, Yue [1 ]
Zeng, Xiangji [1 ]
Zhang, Yin [1 ]
Wan, Yao [2 ]
机构
[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Tech, Wuhan 430027, Hubei, Peoples R China
基金
中国国家自然科学基金;
关键词
Deep learning; neural networks; code representation; cross-language; pre-training; LEARN;
D O I
10.1145/3506696
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCoDE) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.
引用
收藏
页数:44
相关论文
共 50 条
  • [31] Large-Scale Adversarial Training for Vision-and-Language Representation Learning
    Gan, Zhe
    Chen, Yen-Chun
    Li, Linjie
    Zhu, Chen
    Cheng, Yu
    Liu, Jingjing
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [32] GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
    Deng, Xinchi
    Shi, Han
    Huang, Runhui
    Li, Changlin
    Xu, Hang
    Han, Jianhua
    Kwok, James
    Zhao, Shen
    Zhang, Wei
    Liang, Xiaodan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22121 - 22132
  • [33] ANGEL-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent
    Nie, Xiaonan
    Liu, Yi
    Fu, Fangcheng
    Xue, Jinbao
    Jiao, Dian
    Miao, Xupeng
    Tao, Yangyu
    Cui, Bin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 3781 - 3794
  • [34] ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation
    Qi, Weizhen
    Gong, Yeyun
    Yan, Yu
    Xu, Can
    Yao, Bolun
    Zhou, Bartuer
    Cheng, Biao
    Jiang, Daxin
    Chen, Jiusheng
    Zhang, Ruofei
    Li, Hougiang
    Duan, Nan
    ACL-IJCNLP 2021: THE JOINT CONFERENCE OF THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE SYSTEM DEMONSTRATIONS, 2021, : 232 - 239
  • [35] Prediction of chemical reaction yields with large-scale multi-view pre-training
    Shi, Runhan
    Yu, Gufeng
    Huo, Xiaohong
    Yang, Yang
    JOURNAL OF CHEMINFORMATICS, 2024, 16 (01)
  • [36] Prediction of chemical reaction yields with large-scale multi-view pre-training
    Runhan Shi
    Gufeng Yu
    Xiaohong Huo
    Yang Yang
    Journal of Cheminformatics, 16
  • [37] Robust feature learning for online discriminative tracking without large-scale pre-training
    Zhang, Jun
    Zhong, Bineng
    Wang, Pengfei
    Wang, Cheng
    Du, Jixiang
    FRONTIERS OF COMPUTER SCIENCE, 2018, 12 (06) : 1160 - 1172
  • [38] Robust feature learning for online discriminative tracking without large-scale pre-training
    Jun Zhang
    Bineng Zhong
    Pengfei Wang
    Cheng Wang
    Jixiang Du
    Frontiers of Computer Science, 2018, 12 : 1160 - 1172
  • [39] Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
    Zeng, Yan
    Zhou, Wangchunshu
    Luo, Ao
    Cheng, Ziming
    Zhang, Xinsong
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5731 - 5746
  • [40] LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval
    Luo, Ziyang
    Zhao, Pu
    Xu, Can
    Geng, Xiubo
    Shen, Tao
    Tao, Chongyang
    Ma, Jing
    Lin, Qingwei
    Jiang, Daxin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11172 - 11183