A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction

被引:0
|
作者
Phuong, Ha Thi Minh [1 ]
Nguyet, Pham Vu Thu [1 ]
Minh, Nguyen Huu Nhat [1 ]
Hanh, Le Thi My [2 ]
Binh, Nguyen Thanh [1 ]
机构
[1] Univ Danang, Vietnam Korea Univ Informat & Commun Technol, Da Nang 55000, Vietnam
[2] Univ Danang, Univ Sci & Technol, Da Nang 55000, Vietnam
关键词
Data imbalance; Data sampling; Fault prediction; GANs; OPTIMIZATION ALGORITHM; ENSEMBLE;
D O I
10.1007/s10489-024-05930-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software fault prediction (SFP) is the process of identifying potentially defect-prone modules before the testing stage of a software development process. By identifying faults early in the development process, software engineers can spend their efforts on those components most likely to contain defects, thereby improving the overall quality and reliability of the software. However, data imbalance and feature redundancy are challenging issues in SFP that can negatively impact the performance of fault prediction models. Imbalanced software fault datasets, in which the number of normal modules (majority class) is significantly higher than that of faulty modules (minority class), may lead to many false negative results. In this work, we study and perform an empirical assessment of the variants of Generative Adversarial Networks (GANs), an emerging synthetic data generation method, for resolving the data imbalance issue in common software fault prediction datasets. Five GANs variations - CopulaGAN, VanillaGAN, CTGAN, TGAN and WGANGP are utilized to generate synthetic faulty samples to balance the proportion of the majority and minority classes in datasets. Thereafter, we present an extensive evaluation of the performance of different prediction models which involve combining Recursive Feature Elimination (RFE) for feature selection with GANs oversampling methods, along with pairs of Autoencoders for feature extraction with GANs models. Throughout the experiments with five fault datasets extracted from the PROMISE repository, we evaluate six different machine learning approaches using precision, recall, F1-score, Area Under Curve (AUC) and Matthews Correlation Coefficient (MCC) as performance evaluation metrics. The experimental results demonstrate that the combination of CTGAN with RFE and a pair of CTGAN with Autoencoders outperform other baselines for all datasets, followed by WGANGP and VanillaGAN. According to the comparative analysis, GANs-based oversampling methods exhibited significant improvement in dealing with data imbalance for software fault prediction.
引用
收藏
页数:34
相关论文
共 50 条
  • [1] Generative Oversampling Methods for Handling Imbalanced Data in Software Fault Prediction
    Rathore, Santosh Singh
    Chouhan, Satyendra Singh
    Jain, Dixit Kumar
    Vachhani, Aakash Gopal
    IEEE TRANSACTIONS ON RELIABILITY, 2022, 71 (02) : 747 - 762
  • [2] Leveraging Ensemble Learning with Generative Adversarial Networks for Imbalanced Software Defects Prediction
    Alqarni, Amani
    Aljamaan, Hamoud
    APPLIED SCIENCES-BASEL, 2023, 13 (24):
  • [3] Data Augment in Imbalanced Learning Based on Generative Adversarial Networks
    Zhou, Zhuocheng
    Zhang, Bofeng
    Lv, Ying
    Shi, Tian
    Chang, Furong
    NEURAL INFORMATION PROCESSING (ICONIP 2019), PT IV, 2019, 1142 : 21 - 30
  • [4] Machinery fault diagnosis with imbalanced data using deep generative adversarial networks
    Zhang, Wei
    Li, Xiang
    Jia, Xiao-Dong
    Ma, Hui
    Luo, Zhong
    Li, Xu
    MEASUREMENT, 2020, 152
  • [5] Imbalanced Learning for Fault Diagnosis Problem of Rotating Machinery Based on Generative Adversarial Networks
    Xie, Yuan
    Zhang, Tao
    2018 37TH CHINESE CONTROL CONFERENCE (CCC), 2018, : 6017 - 6022
  • [6] Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data
    Xiao, Yawen
    Wu, Jun
    Lin, Zongli
    COMPUTERS IN BIOLOGY AND MEDICINE, 2021, 135 (135)
  • [7] Wind Turbine Fault Diagnosis with Imbalanced SCADA Data Using Generative Adversarial Networks
    Wang, Hong
    Li, Taikun
    Xie, Mingyang
    Tian, Wenfang
    Han, Wei
    ENERGIES, 2025, 18 (05)
  • [8] Effective data generation for imbalanced learning using conditional generative adversarial networks
    Douzas, Georgios
    Bacao, Fernando
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 91 : 464 - 471
  • [9] Handling Imbalanced Data using Ensemble Learning in Software Defect Prediction
    Malhotra, Ruchika
    Jain, Juhi
    PROCEEDINGS OF THE CONFLUENCE 2020: 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, DATA SCIENCE & ENGINEERING, 2020, : 300 - 304
  • [10] Imbalanced Fault Diagnosis of Rolling Bearing Based on Generative Adversarial Network: A Comparative Study
    Mao, Wentao
    Liu, Yamin
    Ding, Ling
    Li, Yuan
    IEEE ACCESS, 2019, 7 : 9515 - 9530