A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction

被引:0
|
作者
Phuong, Ha Thi Minh [1 ]
Nguyet, Pham Vu Thu [1 ]
Minh, Nguyen Huu Nhat [1 ]
Hanh, Le Thi My [2 ]
Binh, Nguyen Thanh [1 ]
机构
[1] Univ Danang, Vietnam Korea Univ Informat & Commun Technol, Da Nang 55000, Vietnam
[2] Univ Danang, Univ Sci & Technol, Da Nang 55000, Vietnam
关键词
Data imbalance; Data sampling; Fault prediction; GANs; OPTIMIZATION ALGORITHM; ENSEMBLE;
D O I
10.1007/s10489-024-05930-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software fault prediction (SFP) is the process of identifying potentially defect-prone modules before the testing stage of a software development process. By identifying faults early in the development process, software engineers can spend their efforts on those components most likely to contain defects, thereby improving the overall quality and reliability of the software. However, data imbalance and feature redundancy are challenging issues in SFP that can negatively impact the performance of fault prediction models. Imbalanced software fault datasets, in which the number of normal modules (majority class) is significantly higher than that of faulty modules (minority class), may lead to many false negative results. In this work, we study and perform an empirical assessment of the variants of Generative Adversarial Networks (GANs), an emerging synthetic data generation method, for resolving the data imbalance issue in common software fault prediction datasets. Five GANs variations - CopulaGAN, VanillaGAN, CTGAN, TGAN and WGANGP are utilized to generate synthetic faulty samples to balance the proportion of the majority and minority classes in datasets. Thereafter, we present an extensive evaluation of the performance of different prediction models which involve combining Recursive Feature Elimination (RFE) for feature selection with GANs oversampling methods, along with pairs of Autoencoders for feature extraction with GANs models. Throughout the experiments with five fault datasets extracted from the PROMISE repository, we evaluate six different machine learning approaches using precision, recall, F1-score, Area Under Curve (AUC) and Matthews Correlation Coefficient (MCC) as performance evaluation metrics. The experimental results demonstrate that the combination of CTGAN with RFE and a pair of CTGAN with Autoencoders outperform other baselines for all datasets, followed by WGANGP and VanillaGAN. According to the comparative analysis, GANs-based oversampling methods exhibited significant improvement in dealing with data imbalance for software fault prediction.
引用
收藏
页数:34
相关论文
共 50 条
  • [31] Comparative study of three machine learning methods for software fault prediction
    Wang, Qi
    Zhu, Jie
    Yu, Bo
    Journal of Shanghai Jiaotong University (Science), 2005, 10 E (02) : 117 - 121
  • [32] A Comparative Study of Three Machine Learning Methods for Software Fault Prediction
    王琪
    朱杰
    于波
    JournalofShanghaiJiaotongUniversity, 2005, (02) : 117 - 121
  • [33] Imbalanced Fault Diagnosis of Rolling Bearing Using Data Synthesis Based on Multi-Resolution Fusion Generative Adversarial Networks
    Hao, Chuanzhu
    Du, Junrong
    Liang, Haoran
    MACHINES, 2022, 10 (05)
  • [34] Data synthesis using dual discriminator conditional generative adversarial networks for imbalanced fault diagnosis of rolling bearings
    Zheng, Taisheng
    Song, Lei
    Wang, Jianxing
    Teng, Wei
    Xu, Xiaoli
    Ma, Chao
    MEASUREMENT, 2020, 158
  • [35] Imbalanced Fault Diagnosis Using Conditional Wasserstein Generative Adversarial Networks With Switchable Normalization
    Fu, Wenlong
    Chen, Yupeng
    Li, Hongyan
    Chen, Xiaoyue
    Chen, Baojia
    IEEE SENSORS JOURNAL, 2023, 23 (23) : 29119 - 29130
  • [36] Data synthesis using deep feature enhanced generative adversarial networks for rolling bearing imbalanced fault diagnosis
    Liu, Shaowei
    Jiang, Hongkai
    Wu, Zhenghong
    Li, Xingqiu
    MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2022, 163
  • [37] A Novel Method for Imbalanced Fault Diagnosis of Rotating Machinery Based on Generative Adversarial Networks
    Li, Zhenxiang
    Zheng, Taisheng
    Wang, Yang
    Cao, Zhi
    Guo, Zhiqi
    Fu, Hongyong
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2021, 70
  • [38] Leveraging generative adversarial networks for data augmentation to improve fault detection in wind turbines with imbalanced data
    Chatterjee, Subhajit
    Byun, Yung-Cheol
    RESULTS IN ENGINEERING, 2025, 25
  • [39] Learning from class-imbalanced data using misclassification-focusing generative adversarial networks
    Yun, Jaesub
    Lee, Jong-Seok
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 240
  • [40] A joint learning method for incomplete and imbalanced data in electronic health record based on generative adversarial networks
    Weng, Xutao
    Song, Hong
    Lin, Yucong
    Wu, You
    Zhang, Xi
    Liu, Bowen
    Yang, Jian
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 168