What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code

被引：23

作者：

Wan, Yao ^{[1
,4
]}

Zhao, Wei ^{[1
,4
]}

Zhang, Hongyu ^{[2
]}

Sui, Yulei ^{[3
]}

Xu, Guandong ^{[3
]}

Jin, Hai ^{[1
,4
]}

机构：

[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan, Peoples R China

[2] Univ Newcastle, Newcastle, NSW, Australia

[3] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW, Australia

[4] HUST, Natl Engn Res Ctr Big Data Technol & Syst, Serv Comp Technol & Syst Lab, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China

来源：

2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022) | 2022年

基金：

中国国家自然科学基金;

关键词：

Code representation; deep learning; pre-trained language model; probing; attention analysis; syntax tree induction;

D O I：

10.1145/3510003.3510050

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

引用

下载

页码：2377 / 2388

页数：12

共 50 条

[41] Deep Entity Matching with Pre-Trained Language Models
Li, Yuliang
Li, Jinfeng
Suhara, Yoshihiko
Doan, AnHai
Tan, Wang-Chiew
PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (01): : 50 - 60
[42] Exploring Lottery Prompts for Pre-trained Language Models
Chen, Yulin
Ding, Ning
Wang, Xiaobin
Hu, Shengding
Zheng, Hai-Tao
Liu, Zhiyuan
Xie, Pengjun
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15428 - 15444
[43] A Survey of Knowledge Enhanced Pre-Trained Language Models
Hu, Linmei
Liu, Zeyi
Zhao, Ziwang
Hou, Lei
Nie, Liqiang
Li, Juanzi
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (04) : 1413 - 1430
[44] Self-conditioning Pre-Trained Language Models
Suau, Xavier
Zappella, Luca
Apostoloff, Nicholas
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[45] Pre-trained models for natural language processing: A survey
QIU XiPeng
SUN TianXiang
XU YiGe
SHAO YunFan
DAI Ning
HUANG XuanJing
Science China Technological Sciences, 2020, 63 (10) : 1872 - 1897
[46] Evaluating the Summarization Comprehension of Pre-Trained Language Models
Chernyshev, D. I.
Dobrov, B. V.
LOBACHEVSKII JOURNAL OF MATHEMATICS, 2023, 44 (08) : 3028 - 3039
[47] Empowering News Recommendation with Pre-trained Language Models
Wu, Chuhan
Wu, Fangzhao
Qi, Tao
Huang, Yongfeng
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1652 - 1656
[48] Capturing Semantics for Imputation with Pre-trained Language Models
Mei, Yinan
Song, Shaoxu
Fang, Chenguang
Yang, Haifeng
Fang, Jingyun
Long, Jiang
2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 61 - 72
[49] Memorisation versus Generalisation in Pre-trained Language Models
Tanzer, Michael
Ruder, Sebastian
Rei, Marek
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7564 - 7578
[50] Understanding Online Attitudes with Pre-Trained Language Models
Power, William
Obradovic, Zoran
PROCEEDINGS OF THE 2023 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING, ASONAM 2023, 2023, : 745 - 752

← 1 2 3 4 5 →