MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation

被引：6

作者：

Dong, Zeming ^{[1
]}

Hu, Qiang ^{[2
]}

Guo, Yuejun ^{[3
]}

Cordy, Maxime ^{[2
]}

Papadakis, Mike ^{[2
]}

Zhang, Zhenya ^{[1
]}

Le Traon, Yves ^{[2
]}

Zhao, Jianjun ^{[1
]}

机构：

[1] Kyushu Univ, Fukuoka, Japan

[2] Univ Luxembourg, Luxembourg, Luxembourg

[3] Luxembourg Inst Sci & Technol, Luxembourg, Luxembourg

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER | 2023年

关键词：

Data augmentation; Mixup; Source code analysis;

D O I：

10.1109/SANER56733.2023.00043

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due to its data-driven nature, a DNN model requires massive and high-quality labeled training data to achieve expert-level performance. Collecting such data is often not hard, but the labeling process is notoriously laborious. The task of DNN-based code analysis even worsens the situation because source code labeling also demands sophisticated expertise. Data augmentation has been a popular approach to supplement training data in domains such as computer vision and NLP. However, existing data augmentation approaches in code analysis adopt simple methods, such as data transformation and adversarial example generation, thus bringing limited performance superiority. In this paper, we propose a data augmentation approach MIXCODE that aims to effectively supplement valid training data, inspired by the recent advance named Mixup in computer vision. Specifically, we first utilize multiple code refactoring methods to generate transformed code that holds consistent labels with the original data. Then, we adapt the Mixup technique to mix the original code with the transformed code to augment the training data. We evaluate MIXCODE on two programming languages (Java and Python), two code tasks (problem classification and bug detection), four benchmark datasets (JAVA250, Python800, CodRepl, and Refactory), and seven model architectures (including two pre-trained models CodeBERT and GraphCodeBERT). Experimental results demonstrate that MIXCODE outperforms the baseline data augmentation approach by up to 6.24% in accuracy and 26.06% in robustness.

引用

页码：379 / 390

页数：12

共 50 条

[21] MixACM: Mixup-Based Robustness Transfer via Distillation of Activated Channel Maps
Awais, Muhammad
Zhou, Fengwei
Xie, Chuanlong
Li, Jiawei
Bae, Sung-Ho
Li, Zhenguo
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[22] Data augmentation with Mixup: Enhancing performance of a functional neuroimaging-based prognostic deep learning classifier in recent onset psychosis
Smucny, Jason
Shi, Ge
Lesh, Tyler A.
Carter, Cameron S.
Davidson, Ian
[J]. NEUROIMAGE-CLINICAL, 2022, 36
[23] MKD: Mixup-Based Knowledge Distillation for Mandarin End-to-End Speech Recognition
Wu, Xing
Jin, Yifan
Wang, Jianjia
Qian, Quan
Guo, Yike
[J]. ALGORITHMS, 2022, 15 (05)
[24] A New Data Augmentation Method Based on Mixup and Dempster-Shafer Theory
Zhang, Zhuo
Wang, Hongfei
Geng, Jie
Deng, Xinyang
Jiang, Wen
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4998 - 5013
[25] Attention mechanism and mixup data augmentation for classification of COVID-19 Computed Tomography images
Ozdemir, Ozgur
Sonmez, Elena Battini
[J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (08) : 6199 - 6207
[26] Mixup-Based Neural Network for Image Restoration and Structure Prediction From SEM Images
Park, Junho
Cho, Yubin
Hwang, Yeieun
Ma, Ami
Kim, Qhwan
Chang, Kyu-Baik
Jeong, Jaehoon
Kang, Suk-Ju
[J]. IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 : 1 - 16
[27] Feature Distribution-Based Medical Data Augmentation: Enhancing Mood Disorder Classification
Yoo, Joo Hun
An, Ji Hyun
Chung, Tai-Myoung
[J]. IEEE ACCESS, 2024, 12 : 127782 - 127791
[28] Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks
Tang, Huidong
Kamei, Sayaka
Morimoto, Yasuhiko
[J]. ALGORITHMS, 2023, 16 (01)
[29] Enhancing Endoscopic Image Classification with Symptom Localization and Data Augmentation
Trung-Hieu Hoang
Hai-Dang Nguyen
Viet-Anh Nguyen
Thanh-An Nguyen
Vinh-Tiep Nguyen
Minh-Triet Tran
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2578 - 2582
[30] K-mixup: Data augmentation for offline reinforcement learning using mixup in a Koopman invariant subspace
Jang, Junwoo
Han, Jungwoo
Kim, Jinwhan
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 225

← 1 2 3 4 5 →