What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code

被引:23
|
作者
Wan, Yao [1 ,4 ]
Zhao, Wei [1 ,4 ]
Zhang, Hongyu [2 ]
Sui, Yulei [3 ]
Xu, Guandong [3 ]
Jin, Hai [1 ,4 ]
机构
[1] Huazhong Univ Sci & Technol, Sch Comp Sci & Technol, Wuhan, Peoples R China
[2] Univ Newcastle, Newcastle, NSW, Australia
[3] Univ Technol Sydney, Sch Comp Sci, Sydney, NSW, Australia
[4] HUST, Natl Engn Res Ctr Big Data Technol & Syst, Serv Comp Technol & Syst Lab, Cluster & Grid Comp Lab, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
Code representation; deep learning; pre-trained language model; probing; attention analysis; syntax tree induction;
D O I
10.1145/3510003.3510050
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.
引用
下载
收藏
页码:2377 / 2388
页数:12
相关论文
共 50 条
  • [1] Pre-trained language models: What do they know?
    Guimaraes, Nuno
    Campos, Ricardo
    Jorge, Alipio
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2024, 14 (01)
  • [2] What do pre-trained code models know about code?
    Karmakar, Anjan
    Robbes, Romain
    2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, : 1332 - 1336
  • [3] Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries
    Al-Kaswan, Ali
    Ahmed, Toufique
    Izadi, Maliheh
    Sawant, Anand Ashok
    Devanbu, Premkumar
    van Deursen, Arie
    2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER, 2023, : 260 - 271
  • [4] Leveraging pre-trained language models for code generation
    Soliman, Ahmed
    Shaheen, Samir
    Hadhoud, Mayada
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3955 - 3980
  • [5] An Empirical Comparison of Pre-Trained Models of Source Code
    Niu, Changan
    Li, Chuanyi
    Ng, Vincent
    Chen, Dongxiao
    Ge, Jidong
    Luo, Bin
    2023 IEEE/ACM 45TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ICSE, 2023, : 2136 - 2148
  • [6] CODEEDITOR: Learning to Edit Source Code with Pre-trained Models
    Li, Jia
    Li, Ge
    Li, Zhuo
    Jin, Zhi
    Hu, Xing
    Zhang, Kechi
    Fu, Zhiyi
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2023, 32 (06)
  • [7] Context Analysis for Pre-trained Masked Language Models
    Lai, Yi-An
    Lalwani, Garima
    Zhang, Yi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3789 - 3804
  • [8] Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding
    Wang, Deze
    Jia, Zhouyang
    Li, Shanshan
    Yu, Yue
    Xiong, Yun
    Dong, Wei
    Liao, Xiangke
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 287 - 298
  • [9] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    ENGINEERING, 2023, 25 : 51 - 65
  • [10] Natural Attack for Pre-trained Models of Code
    Yang, Zhou
    Shi, Jieke
    He, Junda
    Lo, David
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1482 - 1493