XCODE: Towards Cross-Language Code Representation with Large-Scale Pre-Training

被引：5

作者：

Lin, Zehao ^{[1
]}

Li, Guodun ^{[1
]}

Zhang, Jingfeng ^{[1
]}

Deng, Yue ^{[1
]}

Zeng, Xiangji ^{[1
]}

Zhang, Yin ^{[1
]}

Wan, Yao ^{[2
]}

机构：

[1] Zhejiang Univ, Coll Comp Sci & Technol, 38 Zheda Rd, Hangzhou 310027, Zhejiang, Peoples R China

[2] Huazhong Univ Sci & Technol, Sch Comp Sci & Tech, Wuhan 430027, Hubei, Peoples R China

来源：

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY | 2022年 / 31卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Deep learning; neural networks; code representation; cross-language; pre-training; LEARN;

D O I：

10.1145/3506696

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCoDE) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.

引用

页数：44

共 50 条

[41] Alternating Language Modeling for Cross-Lingual Pre-Training
Yang, Jian
Ma, Shuming
Zhang, Dongdong
Wu, Shuangzhi
Li, Zhoujun
Zhou, Ming
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9386 - 9393
[42] Superpixel semantics representation and pre-training for vision-language tasks
Zhang, Siyu
Chen, Yeming
Sun, Yaoru
Wang, Fang
Yang, Jun
Bai, Lizhi
Gao, Shangce
NEUROCOMPUTING, 2025, 615
[43] TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training
Liu, Yulong
Zhu, Guibo
Zhu, Bin
Song, Qi
Ge, Guojing
Chen, Haoran
Qiao, Guanhui
Peng, Ru
Wu, Lingxiang
Wang, Jinqiao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[44] COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation
Wen, Keyu
Xia, Jin
Huang, Yuanyuan
Li, Linyang
Xu, Jiayan
Shao, Jie
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2188 - 2197
[45] Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization
Du, Yali
Yu, Zhongxing
PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, : 579 - 591
[46] WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction
Wu, Qiyu
Nagata, Masaaki
Tsuruoka, Yoshimasa
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11084 - 11099
[47] Editorial for Special Issue on Large-scale Pre-training: Data, Models, and Fine-tuning
Wen, Ji-Rong
Huang, Zi
Zhang, Hanwang
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (02) : 145 - 146
[48] A Comparison between Pre-training and Large-scale Back-translation for Neural Machine Translation
Huang, Dandan
Wang, Kun
Zhang, Yue
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 1718 - 1732
[49] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Chen, Sanyuan
Wang, Chengyi
Chen, Zhengyang
Wu, Yu
Liu, Shujie
Chen, Zhuo
Li, Jinyu
Kanda, Naoyuki
Yoshioka, Takuya
Xiao, Xiong
Wu, Jian
Zhou, Long
Ren, Shuo
Qian, Yanmin
Qian, Yao
Zeng, Michael
Yu, Xiangzhan
Wei, Furu
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1505 - 1518
[50] Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception
Chen, Haoming
Zhang, Zhizhong
Qu, Yanyun
Zhang, Ruixin
Tan, Xin
Xie, Yuan
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 19925 - 19935

← 1 2 3 4 5 →