A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction

被引:0
|
作者
Phuong, Ha Thi Minh [1 ]
Nguyet, Pham Vu Thu [1 ]
Minh, Nguyen Huu Nhat [1 ]
Hanh, Le Thi My [2 ]
Binh, Nguyen Thanh [1 ]
机构
[1] Univ Danang, Vietnam Korea Univ Informat & Commun Technol, Da Nang 55000, Vietnam
[2] Univ Danang, Univ Sci & Technol, Da Nang 55000, Vietnam
关键词
Data imbalance; Data sampling; Fault prediction; GANs; OPTIMIZATION ALGORITHM; ENSEMBLE;
D O I
10.1007/s10489-024-05930-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software fault prediction (SFP) is the process of identifying potentially defect-prone modules before the testing stage of a software development process. By identifying faults early in the development process, software engineers can spend their efforts on those components most likely to contain defects, thereby improving the overall quality and reliability of the software. However, data imbalance and feature redundancy are challenging issues in SFP that can negatively impact the performance of fault prediction models. Imbalanced software fault datasets, in which the number of normal modules (majority class) is significantly higher than that of faulty modules (minority class), may lead to many false negative results. In this work, we study and perform an empirical assessment of the variants of Generative Adversarial Networks (GANs), an emerging synthetic data generation method, for resolving the data imbalance issue in common software fault prediction datasets. Five GANs variations - CopulaGAN, VanillaGAN, CTGAN, TGAN and WGANGP are utilized to generate synthetic faulty samples to balance the proportion of the majority and minority classes in datasets. Thereafter, we present an extensive evaluation of the performance of different prediction models which involve combining Recursive Feature Elimination (RFE) for feature selection with GANs oversampling methods, along with pairs of Autoencoders for feature extraction with GANs models. Throughout the experiments with five fault datasets extracted from the PROMISE repository, we evaluate six different machine learning approaches using precision, recall, F1-score, Area Under Curve (AUC) and Matthews Correlation Coefficient (MCC) as performance evaluation metrics. The experimental results demonstrate that the combination of CTGAN with RFE and a pair of CTGAN with Autoencoders outperform other baselines for all datasets, followed by WGANGP and VanillaGAN. According to the comparative analysis, GANs-based oversampling methods exhibited significant improvement in dealing with data imbalance for software fault prediction.
引用
收藏
页数:34
相关论文
共 50 条
  • [11] Detection of attacks on software defined networks using machine learning techniques and imbalanced data handling methods
    Hassan, Heba A.
    Hemdan, Ezz El-Din
    El-Shafai, Walid
    Shokair, Mona
    Abd El-Samie, Fathi E.
    SECURITY AND PRIVACY, 2024, 7 (02)
  • [12] Review of imbalanced fault diagnosis technology based on generative adversarial networks
    Chen, Hualin
    Wei, Jianan
    Huang, Haisong
    Yuan, Yage
    Wang, Jiaxin
    JOURNAL OF COMPUTATIONAL DESIGN AND ENGINEERING, 2024, 11 (05) : 99 - 124
  • [13] Research on imbalanced learning based on conditional generative adversarial networks
    Zhao H.-X.
    Shi H.-B.
    Wu J.
    Chen X.
    Kongzhi yu Juece/Control and Decision, 2021, 36 (03): : 619 - 628
  • [14] A Novel Method for Fault Diagnosis of Bearings with Small and Imbalanced Data Based on Generative Adversarial Networks
    Tong, Qingbin
    Lu, Feiyu
    Feng, Ziwei
    Wan, Qingzhu
    An, Guoping
    Cao, Junci
    Guo, Tao
    APPLIED SCIENCES-BASEL, 2022, 12 (14):
  • [15] Fault diagnosis method for imbalanced data based on adaptive diffusion models and generative adversarial networks
    Li, Xueyi
    Wu, Xudong
    Wang, Tianyang
    Xie, Yining
    Chu, Fulei
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 147
  • [16] Machine learning data center workloads using generative adversarial networks
    Haverkort B.R.
    Finkbeiner F.
    De Boer P.-T.
    1600, Association for Computing Machinery (48): : 21 - 23
  • [17] An imbalanced data learning method for tool breakage detection based on generative adversarial networks
    Shixu Sun
    Xiaofeng Hu
    Yingchao Liu
    Journal of Intelligent Manufacturing, 2022, 33 : 2441 - 2455
  • [18] An imbalanced data learning method for tool breakage detection based on generative adversarial networks
    Sun, Shixu
    Hu, Xiaofeng
    Liu, Yingchao
    JOURNAL OF INTELLIGENT MANUFACTURING, 2022, 33 (08) : 2441 - 2455
  • [19] Enhanced generative adversarial networks for fault diagnosis of rotating machinery with imbalanced data
    Li, Qi
    Chen, Liang
    Shen, Changqing
    Yang, Bingru
    Zhu, Zhongkui
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2019, 30 (11)
  • [20] Generative adversarial networks for data augmentation in machine fault diagnosis
    Shao, Siyu
    Wang, Pu
    Yan, Ruqiang
    COMPUTERS IN INDUSTRY, 2019, 106 : 85 - 93