A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction

被引:0
|
作者
Phuong, Ha Thi Minh [1 ]
Nguyet, Pham Vu Thu [1 ]
Minh, Nguyen Huu Nhat [1 ]
Hanh, Le Thi My [2 ]
Binh, Nguyen Thanh [1 ]
机构
[1] Univ Danang, Vietnam Korea Univ Informat & Commun Technol, Da Nang 55000, Vietnam
[2] Univ Danang, Univ Sci & Technol, Da Nang 55000, Vietnam
关键词
Data imbalance; Data sampling; Fault prediction; GANs; OPTIMIZATION ALGORITHM; ENSEMBLE;
D O I
10.1007/s10489-024-05930-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software fault prediction (SFP) is the process of identifying potentially defect-prone modules before the testing stage of a software development process. By identifying faults early in the development process, software engineers can spend their efforts on those components most likely to contain defects, thereby improving the overall quality and reliability of the software. However, data imbalance and feature redundancy are challenging issues in SFP that can negatively impact the performance of fault prediction models. Imbalanced software fault datasets, in which the number of normal modules (majority class) is significantly higher than that of faulty modules (minority class), may lead to many false negative results. In this work, we study and perform an empirical assessment of the variants of Generative Adversarial Networks (GANs), an emerging synthetic data generation method, for resolving the data imbalance issue in common software fault prediction datasets. Five GANs variations - CopulaGAN, VanillaGAN, CTGAN, TGAN and WGANGP are utilized to generate synthetic faulty samples to balance the proportion of the majority and minority classes in datasets. Thereafter, we present an extensive evaluation of the performance of different prediction models which involve combining Recursive Feature Elimination (RFE) for feature selection with GANs oversampling methods, along with pairs of Autoencoders for feature extraction with GANs models. Throughout the experiments with five fault datasets extracted from the PROMISE repository, we evaluate six different machine learning approaches using precision, recall, F1-score, Area Under Curve (AUC) and Matthews Correlation Coefficient (MCC) as performance evaluation metrics. The experimental results demonstrate that the combination of CTGAN with RFE and a pair of CTGAN with Autoencoders outperform other baselines for all datasets, followed by WGANGP and VanillaGAN. According to the comparative analysis, GANs-based oversampling methods exhibited significant improvement in dealing with data imbalance for software fault prediction.
引用
收藏
页数:34
相关论文
共 50 条
  • [41] Framework for imbalanced fault diagnosis of rolling bearing using autoencoding generative adversarial learning
    Rathore, Maan Singh
    Harsha, S. P.
    JOURNAL OF THE BRAZILIAN SOCIETY OF MECHANICAL SCIENCES AND ENGINEERING, 2023, 45 (01)
  • [42] Framework for imbalanced fault diagnosis of rolling bearing using autoencoding generative adversarial learning
    Maan Singh Rathore
    S. P. Harsha
    Journal of the Brazilian Society of Mechanical Sciences and Engineering, 2023, 45
  • [43] Generative adversarial network and transfer-learning-based fault detection for rotating machinery with imbalanced data condition
    Li, Jun
    Liu, Yongbao
    Li, Qijie
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2022, 33 (04)
  • [44] Translation of MFL and UT data by using generative adversarial networks: A comparative study
    Ling, Jiatong
    Peng, Xiang
    Peussner, Matthias
    Siggers, Kevin
    Liu, Zheng
    NDT & E INTERNATIONAL, 2025, 149
  • [45] Improved generative adversarial network for vibration-based fault diagnosis with imbalanced data
    Zhao, Bingxi
    Yuan, Qi
    MEASUREMENT, 2021, 169 (169)
  • [46] Comparative Study on Defect Prediction Algorithms of Supervised Learning Software Based on Imbalanced Classification Data Sets
    Ge, Jianxin
    Liu, Jiaomin
    Liu, Wenyuan
    2018 19TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2018, : 399 - 406
  • [47] CMAGAN: classifier-aided minority augmentation generative adversarial networks for industrial imbalanced data and its application to fault prediction
    Wang, Wen-Jie
    Liu, Zhao
    Zhu, Ping
    ADVANCES IN MANUFACTURING, 2024, 12 (03) : 603 - 618
  • [48] Generating Energy Data for Machine Learning with Recurrent Generative Adversarial Networks
    Fekri, Mohammad Navid
    Ghosh, Ananda Mohon
    Grolinger, Katarina
    ENERGIES, 2020, 13 (01)
  • [49] An Imbalanced Data Handling Framework for Industrial Big Data Using a Gaussian Process Regression-Based Generative Adversarial Network
    Oh, Eunseo
    Lee, Hyunsoo
    SYMMETRY-BASEL, 2020, 12 (04):
  • [50] Imbalanced Fault Diagnosis of Rotating Machinery Based on Deep Generative Adversarial Networks with Gradient Penalty
    Luo, Junqi
    Zhu, Liucun
    Li, Quanfang
    Liu, Daopeng
    Chen, Mingyou
    PROCESSES, 2021, 9 (10)