MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation

被引:6
|
作者
Dong, Zeming [1 ]
Hu, Qiang [2 ]
Guo, Yuejun [3 ]
Cordy, Maxime [2 ]
Papadakis, Mike [2 ]
Zhang, Zhenya [1 ]
Le Traon, Yves [2 ]
Zhao, Jianjun [1 ]
机构
[1] Kyushu Univ, Fukuoka, Japan
[2] Univ Luxembourg, Luxembourg, Luxembourg
[3] Luxembourg Inst Sci & Technol, Luxembourg, Luxembourg
关键词
Data augmentation; Mixup; Source code analysis;
D O I
10.1109/SANER56733.2023.00043
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due to its data-driven nature, a DNN model requires massive and high-quality labeled training data to achieve expert-level performance. Collecting such data is often not hard, but the labeling process is notoriously laborious. The task of DNN-based code analysis even worsens the situation because source code labeling also demands sophisticated expertise. Data augmentation has been a popular approach to supplement training data in domains such as computer vision and NLP. However, existing data augmentation approaches in code analysis adopt simple methods, such as data transformation and adversarial example generation, thus bringing limited performance superiority. In this paper, we propose a data augmentation approach MIXCODE that aims to effectively supplement valid training data, inspired by the recent advance named Mixup in computer vision. Specifically, we first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data. Then, we adapt the Mixup technique to mix the original code with the transformed code to augment the training data. We evaluate MIXCODE on two programming languages (Java and Python), two code tasks (problem classification and bug detection), four benchmark datasets (JAVA250, Python800, CodRepl, and Refactory), and seven model architectures (including two pre-trained models CodeBERT and GraphCodeBERT). Experimental results demonstrate that MIXCODE outperforms the baseline data augmentation approach by up to 6.24% in accuracy and 26.06% in robustness.
引用
收藏
页码:379 / 390
页数:12
相关论文
共 50 条
  • [1] MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation
    Dong, Zeming
    Hu, Qiang
    Guo, Yuejun
    Cordy, Maxime
    Papadakis, Mike
    Zhang, Zhenya
    Le Traon, Yves
    Zhao, Jianjun
    [J]. arXiv, 2022,
  • [2] MixCode: Enhancing Code Classification by Mixup-Based Data Augmentation
    Dong, Zeming
    Hu, Qiang
    Guo, Yuejun
    Cordy, Maxime
    Papadakis, Mike
    Zhang, Zhenya
    Traon, Yves Le
    Zhao, Jianjun
    [J]. Proceedings - 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023, 2023, : 379 - 390
  • [3] Mixup-Based Data Augmentation for Histopathologic Cancer Detection
    Xu, Kele
    [J]. MEDICAL PHYSICS, 2019, 46 (06) : E336 - E337
  • [4] DP-Mix: Mixup-based Data Augmentation for Differentially Private Learning
    Bao, Wenxuan
    Pittaluga, Francesco
    Kumar, Vijay B. G.
    Bindschaedler, Vincent
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Enhancing Mixup-based Semi-Supervised Learning with Explicit Lipschitz Regularization
    Gyawali, Prashnna Kumar
    Ghimire, Sandesh
    Wang, Linwei
    [J]. 20TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2020), 2020, : 1046 - 1051
  • [6] Probabilistic Interpolation with Mixup Data Augmentation for Text Classification
    Xu, Rongkang
    Zhang, Yongcheng
    Ren, Kai
    Huang, Yu
    Wei, Xiaomei
    [J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 410 - 421
  • [7] Mixup-based classification of mixed-type defect patterns in wafer bin maps
    Shin, Wooksoo
    Kahng, Hyungu
    Kim, Seoung Bum
    [J]. COMPUTERS & INDUSTRIAL ENGINEERING, 2022, 167
  • [8] G-Mixup: Graph Data Augmentation for Graph Classification
    Han, Xiaotian
    Jiang, Zhimeng
    Liu, Ninghao
    Hu, Xia
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [9] Mixup-Based Acoustic Scene Classification Using Multi-channel Convolutional Neural Network
    Xu, Kele
    Feng, Dawei
    Mi, Haibo
    Zhu, Boqing
    Wang, Dezhi
    Zhang, Lilun
    Cai, Hengxing
    Liu, Shuwen
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 14 - 23
  • [10] FreMix: Frequency-Based Mixup for Data Augmentation
    Xiu, Yang
    Zheng, Xinyi
    Sun, Linlin
    Fang, Zhuohao
    [J]. WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022