Contextuality of Code Representation Learning

被引：0

作者：

Li, Yi ^{[1
]}

Wang, Shaohua ^{[1
]}

Nguyen, Tien N. ^{[2
]}

机构：

[1] New Jersey Inst Technol, Dept Informat, Newark, NJ 07102 USA

[2] Univ Texas Dallas, Dept Comp Sci, Dallas, TX USA

来源：

2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE | 2023年

关键词：

Code Representation Learning; Contextualized Embedding; Contextuality of Code Embedding;

D O I：

10.1109/ASE56229.2023.00029

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Advanced machine learning models (ML) have been successfully leveraged in several software engineering (SE) applications. The existing SE techniques have used the embedding models ranging from static to contextualized ones to build the vectors for program units. The contextualized vectors address a phenomenon in natural language texts called polysemy, which is the coexistence of different meanings of a word/phrase. However, due to different nature, program units exhibit the nature of mixed polysemy. Some code tokens and statements exhibit polysemy while other tokens (e.g., keywords, separators, and operators) and statements maintain the same meaning in different contexts. A natural question is whether static or contextualized embeddings fit better with the nature of mixed polysemy in source code. The answer to this question is helpful for the SE researchers in selecting the right embedding model. We conducted experiments on 12 popular sequence-/tree-/graph-based embedding models and on the samples of a dataset of 10,222 Java projects with +14M methods. We present several contextuality evaluation metrics adapted from natural-language texts to code structures to evaluate the embeddings from those models. Among several findings, we found that the models with higher contextuality help a bug detection model perform better than the static ones. Neither static nor contextualized embedding models fit well with the mixed polysemy nature of source code. Thus, we develop HYCODE, a hybrid embedding model that fits better with the nature of mixed polysemy in source code.

引用

页码：548 / 559

页数：12

共 50 条

[1] Contrastive Code Representation Learning
Jain, Paras
Jain, Ajay
Zhang, Tianjun
Abbeel, Pieter
Gonzalez, Joseph E.
Stoica, Ion
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5954 - 5971
[2] Learning a holistic and comprehensive code representation for code summarization
Yang, Kaiyuan
Wang, Junfeng
Song, Zihua
[J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2023, 203
[3] Contextuality and the probability representation of quantum states
Vladimir I. Man’ko
Alexey A. Strakhov
[J]. Journal of Russian Laser Research, 2013, 34 : 267 - 277
[4] Contextuality and the probability representation of quantum states
Man'ko, Vladimir I.
Strakhov, Alexey A.
[J]. JOURNAL OF RUSSIAN LASER RESEARCH, 2013, 34 (03) : 267 - 277
[5] Fault Localization with Code Coverage Representation Learning
Li, Yi
Wang, Shaohua
Nguyen, Tien N.
[J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 661 - 673
[6] MetaTPTrans: A Meta Learning Approach for Multilingual Code Representation Learning
Pian, Weiguo
Peng, Hanyu
Tang, Xunzhu
Sun, Tiezhu
Tian, Haoye
Habib, Andrew
Klein, Jacques
Bissyande, Tegawende F.
[J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 4, 2023, : 5239 - 5247
[7] Disentangled Code Representation Learning for Multiple Programming Languages
Zhang, Jingfeng
Hong, Haiwen
Zhang, Yin
Wan, Yao
Liu, Ye
Sui, Yulei
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4454 - 4466
[8] Slice-Based Code Change Representation Learning
Zhang, Fengyi
Chen, Bihuan
Zhao, Yufei
Peng, Xin
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER, 2023, : 319 - 330
[9] GTE: learning code AST representation efficiently and effectively
Yihao QIN
Shangwen WANG
Bo LIN
Kang YANG
Xiaoguang MAO
[J]. Science China(Information Sciences), 2025, 68 (03) : 393 - 394
[10] Modular Tree Network for Source Code Representation Learning
Wang, Wenhan
Li, Ge
Shen, Sijie
Xia, Xin
Jin, Zhi
[J]. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2020, 29 (04)

← 1 2 3 4 5 →