VarGAN: Adversarial Learning of Variable Semantic Representations

被引：0

作者：

Lin, Yalan ^{[1
]}

Wan, Chengcheng ^{[2
]}

Bai, Shuwen ^{[3
]}

Gu, Xiaodong ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Software, Shanghai 200240, Peoples R China

[2] East China Normal Univ, Software Engn Inst, Shanghai 200062, Peoples R China

[3] East China Univ Sci & Technol, Dept Comp Sci, Shanghai 200237, Peoples R China

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2024年 / 50卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Codes; Vectors; Generators; Training; Semantics; Task analysis; Generative adversarial networks; Pre-trained language models; variable name representation; identifier representation; generative adversarial networks; CLONE DETECTION;

D O I：

10.1109/TSE.2024.3391730

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.

引用

页码：1505 / 1517

页数：13

共 50 条

[41] Learning Representations of Inorganic Materials from Generative Adversarial Networks
Hu, Tiantian
Song, Hui
Jiang, Tao
Li, Shaobo
SYMMETRY-BASEL, 2020, 12 (11): : 1 - 12
[42] Semantic Advantage for Learning New Phonological Form Representations
Hawkins, Erin
Astle, Duncan E.
Rastle, Kathleen
JOURNAL OF COGNITIVE NEUROSCIENCE, 2015, 27 (04) : 775 - 786
[43] Learning Navigational Visual Representations with Semantic Map Supervision
Hong, Yicong
Zhou, Yang
Zhang, Ruiyi
Dernoncourt, Franck
Bui, Trung
Gould, Stephen
Tan, Hao
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3032 - 3044
[44] Learning Semantic Keypoint Representations for Door Opening Manipulation
Wang, Jiayu
Lin, Shize
Hu, Chuxiong
Zhu, Yu
Zhu, Limin
IEEE ROBOTICS AND AUTOMATION LETTERS, 2020, 5 (04): : 6980 - 6987
[45] Learning Cross-Channel Representations for Semantic Segmentation
Ma, Lingfeng
Xie, Hongtao
Liu, Chuanbin
Zhang, Yongdong
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2774 - 2787
[46] Learning Structured Natural Language Representations for Semantic Parsing
Cheng, Jianpeng
Reddy, Siva
Saraswat, Vijay
Lapata, Mirella
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 44 - 55
[47] Learning spatially semantic representations for cognitive robot navigation
Kostavelis, Ioannis
Gasteratos, Antonios
ROBOTICS AND AUTONOMOUS SYSTEMS, 2013, 61 (12) : 1460 - 1475
[48] Classification of Contractual Conflicts via Learning of Semantic Representations
Aires, Joao Paulo
Granada, Roger
Monteiro, Juarez
Barros, Rodrigo Coelho
Meneguzzi, Felipe
AAMAS '19: PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2019, : 1764 - 1766
[49] Adversarial learning based intermediate feature refinement for semantic segmentation
Wang, Dongli
Yuan, Zhitian
Ouyang, Wanli
Li, Baopu
Zhou, Yan
APPLIED INTELLIGENCE, 2023, 53 (12) : 14775 - 14791
[50] Learning Semantic-aware Normalization for Generative Adversarial Networks
Zheng, Heliang
Fu, Jianlong
Zeng, Yanhong
Luo, Jiebo
Zha, Zheng-Jun
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33

← 1 2 3 4 5 →