Contextuality of Code Representation Learning

被引:0
|
作者
Li, Yi [1 ]
Wang, Shaohua [1 ]
Nguyen, Tien N. [2 ]
机构
[1] New Jersey Inst Technol, Dept Informat, Newark, NJ 07102 USA
[2] Univ Texas Dallas, Dept Comp Sci, Dallas, TX USA
关键词
Code Representation Learning; Contextualized Embedding; Contextuality of Code Embedding;
D O I
10.1109/ASE56229.2023.00029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Advanced machine learning models (ML) have been successfully leveraged in several software engineering (SE) applications. The existing SE techniques have used the embedding models ranging from static to contextualized ones to build the vectors for program units. The contextualized vectors address a phenomenon in natural language texts called polysemy, which is the coexistence of different meanings of a word/phrase. However, due to different nature, program units exhibit the nature of mixed polysemy. Some code tokens and statements exhibit polysemy while other tokens (e.g., keywords, separators, and operators) and statements maintain the same meaning in different contexts. A natural question is whether static or contextualized embeddings fit better with the nature of mixed polysemy in source code. The answer to this question is helpful for the SE researchers in selecting the right embedding model. We conducted experiments on 12 popular sequence-/tree-/graph-based embedding models and on the samples of a dataset of 10,222 Java projects with +14M methods. We present several contextuality evaluation metrics adapted from natural-language texts to code structures to evaluate the embeddings from those models. Among several findings, we found that the models with higher contextuality help a bug detection model perform better than the static ones. Neither static nor contextualized embedding models fit well with the mixed polysemy nature of source code. Thus, we develop HYCODE, a hybrid embedding model that fits better with the nature of mixed polysemy in source code.
引用
收藏
页码:548 / 559
页数:12
相关论文
共 50 条
  • [1] Contrastive Code Representation Learning
    Jain, Paras
    Jain, Ajay
    Zhang, Tianjun
    Abbeel, Pieter
    Gonzalez, Joseph E.
    Stoica, Ion
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5954 - 5971
  • [2] Learning a holistic and comprehensive code representation for code summarization
    Yang, Kaiyuan
    Wang, Junfeng
    Song, Zihua
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2023, 203
  • [3] Contextuality and the probability representation of quantum states
    Vladimir I. Man’ko
    Alexey A. Strakhov
    [J]. Journal of Russian Laser Research, 2013, 34 : 267 - 277
  • [4] Contextuality and the probability representation of quantum states
    Man'ko, Vladimir I.
    Strakhov, Alexey A.
    [J]. JOURNAL OF RUSSIAN LASER RESEARCH, 2013, 34 (03) : 267 - 277
  • [5] Fault Localization with Code Coverage Representation Learning
    Li, Yi
    Wang, Shaohua
    Nguyen, Tien N.
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 661 - 673
  • [6] MetaTPTrans: A Meta Learning Approach for Multilingual Code Representation Learning
    Pian, Weiguo
    Peng, Hanyu
    Tang, Xunzhu
    Sun, Tiezhu
    Tian, Haoye
    Habib, Andrew
    Klein, Jacques
    Bissyande, Tegawende F.
    [J]. THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 4, 2023, : 5239 - 5247
  • [7] Disentangled Code Representation Learning for Multiple Programming Languages
    Zhang, Jingfeng
    Hong, Haiwen
    Zhang, Yin
    Wan, Yao
    Liu, Ye
    Sui, Yulei
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4454 - 4466
  • [8] Slice-Based Code Change Representation Learning
    Zhang, Fengyi
    Chen, Bihuan
    Zhao, Yufei
    Peng, Xin
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER, 2023, : 319 - 330
  • [9] GTE: learning code AST representation efficiently and effectively
    Yihao QIN
    Shangwen WANG
    Bo LIN
    Kang YANG
    Xiaoguang MAO
    [J]. Science China(Information Sciences), 2025, 68 (03) : 393 - 394
  • [10] Modular Tree Network for Source Code Representation Learning
    Wang, Wenhan
    Li, Ge
    Shen, Sijie
    Xia, Xin
    Jin, Zhi
    [J]. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2020, 29 (04)