VarGAN: Adversarial Learning of Variable Semantic Representations

被引：0

作者：

Lin, Yalan ^{[1
]}

Wan, Chengcheng ^{[2
]}

Bai, Shuwen ^{[3
]}

Gu, Xiaodong ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Software, Shanghai 200240, Peoples R China

[2] East China Normal Univ, Software Engn Inst, Shanghai 200062, Peoples R China

[3] East China Univ Sci & Technol, Dept Comp Sci, Shanghai 200237, Peoples R China

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2024年 / 50卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Codes; Vectors; Generators; Training; Semantics; Task analysis; Generative adversarial networks; Pre-trained language models; variable name representation; identifier representation; generative adversarial networks; CLONE DETECTION;

D O I：

10.1109/TSE.2024.3391730

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.

引用

页码：1505 / 1517

页数：13

共 50 条

[1] Semantic Adversarial Deep Learning
Seshia, Sanjit A.
Jha, Somesh
Dreossi, Tommaso
IEEE DESIGN & TEST, 2020, 37 (02) : 8 - 18
[2] Semantic Adversarial Deep Learning
Dreossi, Tommaso
Jha, Somesh
Seshia, Sanjit A.
COMPUTER AIDED VERIFICATION (CAV 2018), PT I, 2018, 10981 : 3 - 26
[3] Adversarial Learning of Cancer Tissue Representations
Quiros, Adalberto Claudio
Coudray, Nicolas
Yeaton, Anna
Sunhem, Wisuwat
Murray-Smith, Roderick
Tsirigos, Aristotelis
Yuan, Ke
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT VIII, 2021, 12908 : 602 - 612
[4] Adversarial Learning for Improved Patient Representations
Shankar, Bharath
Hargreaves, Carol Anne
AUGMENTED INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, ITS 2023, 2023, 13891 : 467 - 476
[5] TwinsAdvNet : Adversarial Learning for Semantic Segmentation
Wang, Dongli
Wang, Bo
Zhou, Yan
2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
[6] Representations for semantic learning webs: Semantic Web technology in learning support
Dzbor, M.
Stutt, A.
Motta, E.
Collins, T.
JOURNAL OF COMPUTER ASSISTED LEARNING, 2007, 23 (01) : 69 - 82
[7] Adversarial Learning of Group and Individual Fair Representations
Liu, Hao
Wong, Raymond Chi-Wing
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT I, PAKDD 2024, 2024, 14645 : 181 - 193
[8] Learning fair representations via an adversarial framework
Qiu, Huadong
Feng, Rui
Hu, Ruoyun
Yang, Xiao
Lin, Shaowa
Tao, Quanjin
Yang, Yang
AI OPEN, 2023, 4 : 91 - 97
[9] UNSUPERVISED LEARNING OF SEMANTIC AUDIO REPRESENTATIONS
Jansen, Aren
PlakaL, Manoj
Pandya, Ratheet
Ellis, Daniel P. W.
Hershey, Shawn
Liu, Jiayang
Moore, R. Channing
Saurous, Rif A.
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 126 - 130
[10] Learning semantic representations of objects and their parts
Mesnil, Gregoire
Bordes, Antoine
Weston, Jason
Chechik, Gal
Bengio, Yoshua
MACHINE LEARNING, 2014, 94 (02) : 281 - 301

← 1 2 3 4 5 →