MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation

被引:6
|
作者
Dong, Zeming [1 ]
Hu, Qiang [2 ]
Guo, Yuejun [3 ]
Cordy, Maxime [2 ]
Papadakis, Mike [2 ]
Zhang, Zhenya [1 ]
Le Traon, Yves [2 ]
Zhao, Jianjun [1 ]
机构
[1] Kyushu Univ, Fukuoka, Japan
[2] Univ Luxembourg, Luxembourg, Luxembourg
[3] Luxembourg Inst Sci & Technol, Luxembourg, Luxembourg
关键词
Data augmentation; Mixup; Source code analysis;
D O I
10.1109/SANER56733.2023.00043
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due to its data-driven nature, a DNN model requires massive and high-quality labeled training data to achieve expert-level performance. Collecting such data is often not hard, but the labeling process is notoriously laborious. The task of DNN-based code analysis even worsens the situation because source code labeling also demands sophisticated expertise. Data augmentation has been a popular approach to supplement training data in domains such as computer vision and NLP. However, existing data augmentation approaches in code analysis adopt simple methods, such as data transformation and adversarial example generation, thus bringing limited performance superiority. In this paper, we propose a data augmentation approach MIXCODE that aims to effectively supplement valid training data, inspired by the recent advance named Mixup in computer vision. Specifically, we first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data. Then, we adapt the Mixup technique to mix the original code with the transformed code to augment the training data. We evaluate MIXCODE on two programming languages (Java and Python), two code tasks (problem classification and bug detection), four benchmark datasets (JAVA250, Python800, CodRepl, and Refactory), and seven model architectures (including two pre-trained models CodeBERT and GraphCodeBERT). Experimental results demonstrate that MIXCODE outperforms the baseline data augmentation approach by up to 6.24% in accuracy and 26.06% in robustness.
引用
收藏
页码:379 / 390
页数:12
相关论文
共 50 条
  • [21] MixACM: Mixup-Based Robustness Transfer via Distillation of Activated Channel Maps
    Awais, Muhammad
    Zhou, Fengwei
    Xie, Chuanlong
    Li, Jiawei
    Bae, Sung-Ho
    Li, Zhenguo
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [22] Data augmentation with Mixup: Enhancing performance of a functional neuroimaging-based prognostic deep learning classifier in recent onset psychosis
    Smucny, Jason
    Shi, Ge
    Lesh, Tyler A.
    Carter, Cameron S.
    Davidson, Ian
    [J]. NEUROIMAGE-CLINICAL, 2022, 36
  • [23] MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition
    Wu, Xing
    Jin, Yifan
    Wang, Jianjia
    Qian, Quan
    Guo, Yike
    [J]. ALGORITHMS, 2022, 15 (05)
  • [24] A New Data Augmentation Method Based on Mixup and Dempster-Shafer Theory
    Zhang, Zhuo
    Wang, Hongfei
    Geng, Jie
    Deng, Xinyang
    Jiang, Wen
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4998 - 5013
  • [25] Attention mechanism and mixup data augmentation for classification of COVID-19 Computed Tomography images
    Ozdemir, Ozgur
    Sonmez, Elena Battini
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (08) : 6199 - 6207
  • [26] Mixup-Based Neural Network for Image Restoration and Structure Prediction From SEM Images
    Park, Junho
    Cho, Yubin
    Hwang, Yeieun
    Ma, Ami
    Kim, Qhwan
    Chang, Kyu-Baik
    Jeong, Jaehoon
    Kang, Suk-Ju
    [J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 : 1 - 16
  • [27] Feature Distribution-Based Medical Data Augmentation: Enhancing Mood Disorder Classification
    Yoo, Joo Hun
    An, Ji Hyun
    Chung, Tai-Myoung
    [J]. IEEE ACCESS, 2024, 12 : 127782 - 127791
  • [28] Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks
    Tang, Huidong
    Kamei, Sayaka
    Morimoto, Yasuhiko
    [J]. ALGORITHMS, 2023, 16 (01)
  • [29] Enhancing Endoscopic Image Classification with Symptom Localization and Data Augmentation
    Trung-Hieu Hoang
    Hai-Dang Nguyen
    Viet-Anh Nguyen
    Thanh-An Nguyen
    Vinh-Tiep Nguyen
    Minh-Triet Tran
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2578 - 2582
  • [30] K-mixup: Data augmentation for offline reinforcement learning using mixup in a Koopman invariant subspace
    Jang, Junwoo
    Han, Jungwoo
    Kim, Jinwhan
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 225