A comparative study of handling imbalanced data using generative adversarial networks for machine learning based software fault prediction

被引:0
|
作者
Phuong, Ha Thi Minh [1 ]
Nguyet, Pham Vu Thu [1 ]
Minh, Nguyen Huu Nhat [1 ]
Hanh, Le Thi My [2 ]
Binh, Nguyen Thanh [1 ]
机构
[1] Univ Danang, Vietnam Korea Univ Informat & Commun Technol, Da Nang 55000, Vietnam
[2] Univ Danang, Univ Sci & Technol, Da Nang 55000, Vietnam
关键词
Data imbalance; Data sampling; Fault prediction; GANs; OPTIMIZATION ALGORITHM; ENSEMBLE;
D O I
10.1007/s10489-024-05930-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software fault prediction (SFP) is the process of identifying potentially defect-prone modules before the testing stage of a software development process. By identifying faults early in the development process, software engineers can spend their efforts on those components most likely to contain defects, thereby improving the overall quality and reliability of the software. However, data imbalance and feature redundancy are challenging issues in SFP that can negatively impact the performance of fault prediction models. Imbalanced software fault datasets, in which the number of normal modules (majority class) is significantly higher than that of faulty modules (minority class), may lead to many false negative results. In this work, we study and perform an empirical assessment of the variants of Generative Adversarial Networks (GANs), an emerging synthetic data generation method, for resolving the data imbalance issue in common software fault prediction datasets. Five GANs variations - CopulaGAN, VanillaGAN, CTGAN, TGAN and WGANGP are utilized to generate synthetic faulty samples to balance the proportion of the majority and minority classes in datasets. Thereafter, we present an extensive evaluation of the performance of different prediction models which involve combining Recursive Feature Elimination (RFE) for feature selection with GANs oversampling methods, along with pairs of Autoencoders for feature extraction with GANs models. Throughout the experiments with five fault datasets extracted from the PROMISE repository, we evaluate six different machine learning approaches using precision, recall, F1-score, Area Under Curve (AUC) and Matthews Correlation Coefficient (MCC) as performance evaluation metrics. The experimental results demonstrate that the combination of CTGAN with RFE and a pair of CTGAN with Autoencoders outperform other baselines for all datasets, followed by WGANGP and VanillaGAN. According to the comparative analysis, GANs-based oversampling methods exhibited significant improvement in dealing with data imbalance for software fault prediction.
引用
收藏
页数:34
相关论文
共 50 条
  • [21] Imbalanced Fault Diagnosis of Rolling Bearing Using Enhanced Generative Adversarial Networks
    Zhang, Hongliang
    Wang, Rui
    Pan, Ruilin
    Pan, Haiyang
    IEEE ACCESS, 2020, 8 : 185950 - 185963
  • [22] An Intelligent Fault Diagnosis Method for Imbalanced Nuclear Power Plant Data Based on Generative Adversarial Networks
    Yuntao Dai
    Lizhang Peng
    Zhaobo Juan
    Yuan Liang
    Jihong Shen
    Shujuan Wang
    Sichao Tan
    Hongyan Yu
    Mingze Sun
    Journal of Electrical Engineering & Technology, 2023, 18 : 3237 - 3252
  • [23] An Intelligent Fault Diagnosis Method for Imbalanced Nuclear Power Plant Data Based on Generative Adversarial Networks
    Dai, Yuntao
    Peng, Lizhang
    Juan, Zhaobo
    Liang, Yuan
    Shen, Jihong
    Wang, Shujuan
    Tan, Sichao
    Yu, Hongyan
    Sun, Mingze
    JOURNAL OF ELECTRICAL ENGINEERING & TECHNOLOGY, 2023, 18 (04) : 3237 - 3252
  • [24] Data augment method for machine fault diagnosis using conditional generative adversarial networks
    Wang, Jinrui
    Han, Baokun
    Bao, Huaiqian
    Wang, Mingyan
    Chu, Zhenyun
    Shen, Yuwei
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART D-JOURNAL OF AUTOMOBILE ENGINEERING, 2020, 234 (12) : 2719 - 2727
  • [25] Fault Diagnosis of Harmonic Drive With Imbalanced Data Using Generative Adversarial Network
    Yang, Guo
    Zhong, Yong
    Yang, Lie
    Tao, Hui
    Li, Jianying
    Du, Ruxu
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2021, 70
  • [26] Fault diagnosis method based on triple generative adversarial nets for imbalanced data
    Su, Changwei
    Wang, Xueren
    Liu, Ruijie
    Guo, Ziyi
    Sang, Shengtian
    Yu, Shuang
    Zhang, Haifeng
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2023, 34 (03)
  • [27] Application of tabular data synthesis using generative adversarial networks on machine learning-based multiaxial fatigue life prediction
    He, GaoYuan
    Zhao, YongXiang
    Yan, ChuLiang
    INTERNATIONAL JOURNAL OF PRESSURE VESSELS AND PIPING, 2022, 199
  • [28] Generalization of Deep Neural Networks for Imbalanced Fault Classification of Machinery Using Generative Adversarial Networks
    Wang, Jinrui
    Li, Shunming
    Han, Baokun
    An, Zenghui
    Bao, Huaiqian
    Ji, Shanshan
    IEEE ACCESS, 2019, 7 : 111168 - 111180
  • [29] Generative adversarial networks with Gramian angular field for handling imbalanced data in specific emitter identification
    Zhang, Yezhuo
    Zhou, Zinan
    Li, Xuanpeng
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (03) : 2929 - 2938
  • [30] Generative adversarial networks with Gramian angular field for handling imbalanced data in specific emitter identification
    Yezhuo Zhang
    Zinan Zhou
    Xuanpeng Li
    Signal, Image and Video Processing, 2024, 18 : 2929 - 2938