MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation

被引:6
|
作者
Dong, Zeming [1 ]
Hu, Qiang [2 ]
Guo, Yuejun [3 ]
Cordy, Maxime [2 ]
Papadakis, Mike [2 ]
Zhang, Zhenya [1 ]
Le Traon, Yves [2 ]
Zhao, Jianjun [1 ]
机构
[1] Kyushu Univ, Fukuoka, Japan
[2] Univ Luxembourg, Luxembourg, Luxembourg
[3] Luxembourg Inst Sci & Technol, Luxembourg, Luxembourg
关键词
Data augmentation; Mixup; Source code analysis;
D O I
10.1109/SANER56733.2023.00043
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due to its data-driven nature, a DNN model requires massive and high-quality labeled training data to achieve expert-level performance. Collecting such data is often not hard, but the labeling process is notoriously laborious. The task of DNN-based code analysis even worsens the situation because source code labeling also demands sophisticated expertise. Data augmentation has been a popular approach to supplement training data in domains such as computer vision and NLP. However, existing data augmentation approaches in code analysis adopt simple methods, such as data transformation and adversarial example generation, thus bringing limited performance superiority. In this paper, we propose a data augmentation approach MIXCODE that aims to effectively supplement valid training data, inspired by the recent advance named Mixup in computer vision. Specifically, we first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data. Then, we adapt the Mixup technique to mix the original code with the transformed code to augment the training data. We evaluate MIXCODE on two programming languages (Java and Python), two code tasks (problem classification and bug detection), four benchmark datasets (JAVA250, Python800, CodRepl, and Refactory), and seven model architectures (including two pre-trained models CodeBERT and GraphCodeBERT). Experimental results demonstrate that MIXCODE outperforms the baseline data augmentation approach by up to 6.24% in accuracy and 26.06% in robustness.
引用
收藏
页码:379 / 390
页数:12
相关论文
共 50 条
  • [31] Improving Robustness Using MixUp and CutMix Augmentation for Corn Leaf Diseases Classification based on ConvMixer Architecture
    Li, Li-Hua
    Tanone, Radius
    [J]. JOURNAL OF ICT RESEARCH AND APPLICATIONS, 2023, 17 (02) : 167 - 180
  • [32] Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation
    Liu, Shangqing
    Ma, Wei
    Wang, Jian
    Xie, Xiaofei
    Feng, Ruitao
    Liu, Yang
    [J]. PROCEEDINGS OF THE 25TH ACM SIGPLAN/SIGBED INTERNATIONAL CONFERENCE ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS, LCTES 2024, 2024, : 166 - 177
  • [33] GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring
    Decoupes, Remy
    Roche, Mathieu
    Teisseire, Maguelonne
    [J]. INTELLIGENT DATA ANALYSIS, 2024, 28 (02) : 507 - 531
  • [34] InViTMixup: plant disease classification using convolutional vision transformer with Mixup augmentation
    Devi, R. S. Sandhya
    Kumar, V. R. Vijay
    Sivakumar, P.
    [J]. JOURNAL OF THE CHINESE INSTITUTE OF ENGINEERS, 2024, 47 (05) : 520 - 527
  • [35] Data augmentation by morphological mixup for solving Raven's progressive matrices
    He, Wentao
    Ren, Jianfeng
    Bai, Ruibin
    [J]. VISUAL COMPUTER, 2024, 40 (04): : 2457 - 2470
  • [36] Mixup Data Augmentation for COVID-19 Infection Percentage Estimation
    Spatafora, Maria Ausilia Napoli
    Ortis, Alessandro
    Battiato, Sebastiano
    [J]. IMAGE ANALYSIS AND PROCESSING, ICIAP 2022 WORKSHOPS, PT II, 2022, 13374 : 508 - 519
  • [37] Data augmentation by morphological mixup for solving Raven’s progressive matrices
    Wentao He
    Jianfeng Ren
    Ruibin Bai
    [J]. The Visual Computer, 2024, 40 : 2457 - 2470
  • [38] Enhancing deep learning image classification using data augmentation and genetic algorithm-based optimization
    Boudouh, Nouara
    Mokhtari, Bilal
    Foufou, Sebti
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2024, 13 (03)
  • [39] AdaMix: Adaptive Resampling of Multiscale Object Mixup for Lidar Data Augmentation
    Zhai, Ruifeng
    Gao, Fengli
    Guo, Yanliang
    Huang, Wuling
    Song, Junfeng
    Li, Xueyan
    Ma, Rui
    [J]. IEEE INTELLIGENT TRANSPORTATION SYSTEMS MAGAZINE, 2024, : 68 - 86
  • [40] Pseudo-Bag Mixup Augmentation for Multiple Instance Learning-Based Whole Slide Image Classification
    Liu, Pei
    Ji, Luping
    Zhang, Xinyu
    Ye, Feng
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (05) : 1841 - 1852