VarGAN: Adversarial Learning of Variable Semantic Representations

被引：0

作者：

Lin, Yalan ^{[1
]}

Wan, Chengcheng ^{[2
]}

Bai, Shuwen ^{[3
]}

Gu, Xiaodong ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Sch Software, Shanghai 200240, Peoples R China

[2] East China Normal Univ, Software Engn Inst, Shanghai 200062, Peoples R China

[3] East China Univ Sci & Technol, Dept Comp Sci, Shanghai 200237, Peoples R China

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2024年 / 50卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Codes; Vectors; Generators; Training; Semantics; Task analysis; Generative adversarial networks; Pre-trained language models; variable name representation; identifier representation; generative adversarial networks; CLONE DETECTION;

D O I：

10.1109/TSE.2024.3391730

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Variable names are of critical importance in code representation learning. However, due to diverse naming conventions, variables often receive arbitrary names, leading to long-tail, out-of-vocabulary (OOV), and other well-known problems. While the Byte-Pair Encoding (BPE) tokenizer has addressed the surface-level recognition of low-frequency tokens, it has not noticed the inadequate training of low-frequency identifiers by code representation models, resulting in an imbalanced distribution of rare and common identifiers. Consequently, code representation models struggle to effectively capture the semantics of low-frequency variable names. In this paper, we propose VarGAN, a novel method for variable name representations. VarGAN strengthens the training of low-frequency variables through adversarial training. Specifically, we regard the code representation model as a generator responsible for producing vectors from source code. Additionally, we employ a discriminator that detects whether the code input to the generator contains low-frequency variables. This adversarial setup regularizes the distribution of rare variables, making them overlap with their corresponding high-frequency counterparts in the vector space. Experimental results demonstrate that VarGAN empowers CodeBERT to generate code vectors that exhibit more uniform distribution for both low- and high-frequency identifiers. There is an improvement of 8% in similarity and relatedness scores compared to VarCLR in the IdBench benchmark. VarGAN is also validated in downstream tasks, where it exhibits enhanced capabilities in capturing token- and code-level semantics.

引用

页码：1505 / 1517

页数：13

共 50 条

[21] Learning Semantic Representations to Verify Hardware Designs
Vasudevan, Shobha
Jiang, Wenjie
Bieber, David
Singh, Rishabh
Shojaei, Hamid
Ho, Richard
Sutton, Charles
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[22] Learning Continuous Semantic Representations of Symbolic Expressions
Allamanis, Miltiadis
Chanthirasegaran, Pankajan
Kohli, Pushmeet
Sutton, Charles
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
[23] Learning Debiased and Disentangled Representations for Semantic Segmentation
Chu, Sanghyeok
Kim, Dongwan
Han, Bohyung
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[24] Learning Semantic Representations for Rating Vietnamese Comments
Due-Hong Pham
Anh-Cuong Le
Thi-Kim-Chung Le
2016 EIGHTH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2016, : 193 - 198
[25] Adversarial Learning for Implicit Semantic-Aware Communications
Lu, Zhimin
Xiao, Yong
Sun, Zijian
Li, Yingyu
Shi, Guangming
Chen, Xianfu
Bennis, Mehdi
Poor, H. Vincent
ICC 2023-IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS, 2023, : 4063 - 4069
[26] Robust Semantic Parsing with Adversarial Learning for Domain Generalization
Marzinotto, Gabriel
Damnati, Geraldine
Bechet, Frederic
Favre, Benoit
2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES(NAACL HLT 2019), VOL. 2 (INDUSTRY PAPERS), 2019, : 166 - 173
[27] Joint Adversarial Learning for Domain Adaptation in Semantic Segmentation
Zhang, Yixin
Wang, Zilei
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 6877 - 6884
[28] Structural Semantic Adversarial Active Learning for Image Captioning
Zhang, Beichen
Li, Liang
Su, Li
Wang, Shuhui
Deng, Jincan
Zha, Zheng-Jun
Huang, Qingming
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1112 - 1121
[29] Joint adversarial learning for domain adaptation in semantic segmentation
Zhang, Yixin
Wang, Zilei
AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 2020, : 6877 - 6884
[30] Adversarial Zero-Shot Learning with Semantic Augmentation
Tong, Bin
Klinkigt, Martin
Chen, Junwen
Cui, Xiankun
Kong, Quan
Murakami, Tomokazu
Kobayashi, Yoshiyuki
THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2476 - 2483

← 1 2 3 4 5 →