XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training

被引：5

作者：

Lin, Zehao ^{[1
]}

Li, Guodun ^{[1
]}

Zhang, Jingfeng ^{[1
]}

Deng, Yue ^{[1
]}

Zeng, Xiangji ^{[1
]}

Zhang, Yin ^{[1
]}

Wan, Yao ^{[2
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China

[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Tech, Wuhan 430027, Hubei, Peoples R China

来源：

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY | 2022年 / 31卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Deep learning; neural networks; code representation; cross-language; pre-training; LEARN;

D O I：

10.1145/3506696

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCoDE) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.

引用

页数：44

共 50 条

[31] Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Gan, Zhe
Chen, Yen-Chun
Li, Linjie
Zhu, Chen
Cheng, Yu
Liu, Jingjing
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
[32] GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Deng, Xinchi
Shi, Han
Huang, Runhui
Li, Changlin
Xu, Hang
Han, Jianhua
Kwok, James
Zhao, Shen
Zhang, Wei
Liang, Xiaodan
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22121 - 22132
[33] ANGEL-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent
Nie, Xiaonan
Liu, Yi
Fu, Fangcheng
Xue, Jinbao
Jiao, Dian
Miao, Xupeng
Tao, Yangyu
Cui, Bin
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (12): : 3781 - 3794
[34] ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation
Qi, Weizhen
Gong, Yeyun
Yan, Yu
Xu, Can
Yao, Bolun
Zhou, Bartuer
Cheng, Biao
Jiang, Daxin
Chen, Jiusheng
Zhang, Ruofei
Li, Hougiang
Duan, Nan
ACL-IJCNLP 2021: THE JOINT CONFERENCE OF THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE SYSTEM DEMONSTRATIONS, 2021, : 232 - 239
[35] Prediction of chemical reaction yields with large-scale multi-view pre-training
Shi, Runhan
Yu, Gufeng
Huo, Xiaohong
Yang, Yang
JOURNAL OF CHEMINFORMATICS, 2024, 16 (01)
[36] Prediction of chemical reaction yields with large-scale multi-view pre-training
Runhan Shi
Gufeng Yu
Xiaohong Huo
Yang Yang
Journal of Cheminformatics, 16
[37] Robust feature learning for online discriminative tracking without large-scale pre-training
Zhang, Jun
Zhong, Bineng
Wang, Pengfei
Wang, Cheng
Du, Jixiang
FRONTIERS OF COMPUTER SCIENCE, 2018, 12 (06) : 1160 - 1172
[38] Robust feature learning for online discriminative tracking without large-scale pre-training
Jun Zhang
Bineng Zhong
Pengfei Wang
Cheng Wang
Jixiang Du
Frontiers of Computer Science, 2018, 12 : 1160 - 1172
[39] Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
Zeng, Yan
Zhou, Wangchunshu
Luo, Ao
Cheng, Ziming
Zhang, Xinsong
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 5731 - 5746
[40] LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval
Luo, Ziyang
Zhao, Pu
Xu, Can
Geng, Xiubo
Shen, Tao
Tao, Chongyang
Ma, Jing
Lin, Qingwei
Jiang, Daxin
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11172 - 11183

← 1 2 3 4 5 →