VarGAN: Adversarial Learning of Variable Semantic Representations

被引:0
|
作者
Lin, Yalan [1 ]
Wan, Chengcheng [2 ]
Bai, Shuwen [3 ]
Gu, Xiaodong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Software, Shanghai 200240, Peoples R China
[2] East China Normal Univ, Software Engn Inst, Shanghai 200062, Peoples R China
[3] East China Univ Sci & Technol, Dept Comp Sci, Shanghai 200237, Peoples R China
基金
中国国家自然科学基金;
关键词
Codes; Vectors; Generators; Training; Semantics; Task analysis; Generative adversarial networks; Pre-trained language models; variable name representation; identifier representation; generative adversarial networks; CLONE DETECTION;
D O I
10.1109/TSE.2024.3391730
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.
引用
收藏
页码:1505 / 1517
页数:13
相关论文
共 50 条
  • [1] Semantic Adversarial Deep Learning
    Seshia, Sanjit A.
    Jha, Somesh
    Dreossi, Tommaso
    IEEE DESIGN & TEST, 2020, 37 (02) : 8 - 18
  • [2] Semantic Adversarial Deep Learning
    Dreossi, Tommaso
    Jha, Somesh
    Seshia, Sanjit A.
    COMPUTER AIDED VERIFICATION (CAV 2018), PT I, 2018, 10981 : 3 - 26
  • [3] Adversarial Learning of Cancer Tissue Representations
    Quiros, Adalberto Claudio
    Coudray, Nicolas
    Yeaton, Anna
    Sunhem, Wisuwat
    Murray-Smith, Roderick
    Tsirigos, Aristotelis
    Yuan, Ke
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT VIII, 2021, 12908 : 602 - 612
  • [4] Adversarial Learning for Improved Patient Representations
    Shankar, Bharath
    Hargreaves, Carol Anne
    AUGMENTED INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, ITS 2023, 2023, 13891 : 467 - 476
  • [5] TwinsAdvNet : Adversarial Learning for Semantic Segmentation
    Wang, Dongli
    Wang, Bo
    Zhou, Yan
    2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
  • [6] Representations for semantic learning webs: Semantic Web technology in learning support
    Dzbor, M.
    Stutt, A.
    Motta, E.
    Collins, T.
    JOURNAL OF COMPUTER ASSISTED LEARNING, 2007, 23 (01) : 69 - 82
  • [7] Adversarial Learning of Group and Individual Fair Representations
    Liu, Hao
    Wong, Raymond Chi-Wing
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I, PAKDD 2024, 2024, 14645 : 181 - 193
  • [8] Learning fair representations via an adversarial framework
    Qiu, Huadong
    Feng, Rui
    Hu, Ruoyun
    Yang, Xiao
    Lin, Shaowa
    Tao, Quanjin
    Yang, Yang
    AI OPEN, 2023, 4 : 91 - 97
  • [9] UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS
    Jansen, Aren
    PlakaL, Manoj
    Pandya, Ratheet
    Ellis, Daniel P. W.
    Hershey, Shawn
    Liu, Jiayang
    Moore, R. Channing
    Saurous, Rif A.
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 126 - 130
  • [10] Learning semantic representations of objects and their parts
    Mesnil, Gregoire
    Bordes, Antoine
    Weston, Jason
    Chechik, Gal
    Bengio, Yoshua
    MACHINE LEARNING, 2014, 94 (02) : 281 - 301