What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code

被引：23

作者：

Wan, Yao ^{[1
,4
]}

Zhao, Wei ^{[1
,4
]}

Zhang, Hongyu ^{[2
]}

Sui, Yulei ^{[3
]}

Xu, Guandong ^{[3
]}

Jin, Hai ^{[1
,4
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan, Peoples R China

[2] Univ Newcastle, Newcastle, NSW, Australia

[3] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW, Australia

[4] HUST, Natl Engn Res Ctr Big Data Technol & Syst, Serv Comp Technol & Syst Lab, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China

来源：

2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022) | 2022年

基金：

中国国家自然科学基金;

关键词：

Code representation; deep learning; pre-trained language model; probing; attention analysis; syntax tree induction;

D O I：

10.1145/3510003.3510050

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

引用

下载

页码：2377 / 2388

页数：12

共 50 条

[1] Pre-trained language models: What do they know?
Guimaraes, Nuno
Campos, Ricardo
Jorge, Alipio
WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2024, 14 (01)
[2] What do pre-trained code models know about code?
Karmakar, Anjan
Robbes, Romain
2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, : 1332 - 1336
[3] Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries
Al-Kaswan, Ali
Ahmed, Toufique
Izadi, Maliheh
Sawant, Anand Ashok
Devanbu, Premkumar
van Deursen, Arie
2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER, 2023, : 260 - 271
[4] Leveraging pre-trained language models for code generation
Soliman, Ahmed
Shaheen, Samir
Hadhoud, Mayada
COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3955 - 3980
[5] An Empirical Comparison of Pre-Trained Models of Source Code
Niu, Changan
Li, Chuanyi
Ng, Vincent
Chen, Dongxiao
Ge, Jidong
Luo, Bin
2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, 2023, : 2136 - 2148
[6] CODEEDITOR: Learning to Edit Source Code with Pre-trained Models
Li, Jia
Li, Ge
Li, Zhuo
Jin, Zhi
Hu, Xing
Zhang, Kechi
Fu, Zhiyi
ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2023, 32 (06)
[7] Context Analysis for Pre-trained Masked Language Models
Lai, Yi-An
Lalwani, Garima
Zhang, Yi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3789 - 3804
[8] Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding
Wang, Deze
Jia, Zhouyang
Li, Shanshan
Yu, Yue
Xiong, Yun
Dong, Wei
Liao, Xiangke
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 287 - 298
[9] Pre-Trained Language Models and Their Applications
Wang, Haifeng
Li, Jiwei
Wu, Hua
Hovy, Eduard
Sun, Yu
ENGINEERING, 2023, 25 : 51 - 65
[10] Natural Attack for Pre-trained Models of Code
Yang, Zhou
Shi, Jieke
He, Junda
Lo, David
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1482 - 1493

← 1 2 3 4 5 →