VarGAN: Adversarial Learning of Variable Semantic Representations

被引:0
|
作者
Lin, Yalan [1 ]
Wan, Chengcheng [2 ]
Bai, Shuwen [3 ]
Gu, Xiaodong [1 ]
机构
[1] Shanghai Jiao Tong Univ, Sch Software, Shanghai 200240, Peoples R China
[2] East China Normal Univ, Software Engn Inst, Shanghai 200062, Peoples R China
[3] East China Univ Sci & Technol, Dept Comp Sci, Shanghai 200237, Peoples R China
基金
中国国家自然科学基金;
关键词
Codes; Vectors; Generators; Training; Semantics; Task analysis; Generative adversarial networks; Pre-trained language models; variable name representation; identifier representation; generative adversarial networks; CLONE DETECTION;
D O I
10.1109/TSE.2024.3391730
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.
引用
收藏
页码:1505 / 1517
页数:13
相关论文
共 50 条
  • [41] Learning Representations of Inorganic Materials from Generative Adversarial Networks
    Hu, Tiantian
    Song, Hui
    Jiang, Tao
    Li, Shaobo
    SYMMETRY-BASEL, 2020, 12 (11): : 1 - 12
  • [42] Semantic Advantage for Learning New Phonological Form Representations
    Hawkins, Erin
    Astle, Duncan E.
    Rastle, Kathleen
    JOURNAL OF COGNITIVE NEUROSCIENCE, 2015, 27 (04) : 775 - 786
  • [43] Learning Navigational Visual Representations with Semantic Map Supervision
    Hong, Yicong
    Zhou, Yang
    Zhang, Ruiyi
    Dernoncourt, Franck
    Bui, Trung
    Gould, Stephen
    Tan, Hao
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3032 - 3044
  • [44] Learning Semantic Keypoint Representations for Door Opening Manipulation
    Wang, Jiayu
    Lin, Shize
    Hu, Chuxiong
    Zhu, Yu
    Zhu, Limin
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2020, 5 (04): : 6980 - 6987
  • [45] Learning Cross-Channel Representations for Semantic Segmentation
    Ma, Lingfeng
    Xie, Hongtao
    Liu, Chuanbin
    Zhang, Yongdong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2774 - 2787
  • [46] Learning Structured Natural Language Representations for Semantic Parsing
    Cheng, Jianpeng
    Reddy, Siva
    Saraswat, Vijay
    Lapata, Mirella
    PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 44 - 55
  • [47] Learning spatially semantic representations for cognitive robot navigation
    Kostavelis, Ioannis
    Gasteratos, Antonios
    ROBOTICS AND AUTONOMOUS SYSTEMS, 2013, 61 (12) : 1460 - 1475
  • [48] Classification of Contractual Conflicts via Learning of Semantic Representations
    Aires, Joao Paulo
    Granada, Roger
    Monteiro, Juarez
    Barros, Rodrigo Coelho
    Meneguzzi, Felipe
    AAMAS '19: PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2019, : 1764 - 1766
  • [49] Adversarial learning based intermediate feature refinement for semantic segmentation
    Wang, Dongli
    Yuan, Zhitian
    Ouyang, Wanli
    Li, Baopu
    Zhou, Yan
    APPLIED INTELLIGENCE, 2023, 53 (12) : 14775 - 14791
  • [50] Learning Semantic-aware Normalization for Generative Adversarial Networks
    Zheng, Heliang
    Fu, Jianlong
    Zeng, Yanhong
    Luo, Jiebo
    Zha, Zheng-Jun
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33