Software Defect Prediction with Skewed Data

被引:0
|
作者
Seliya, Naeem [1 ]
Khoshgoftaar, Taghi M. [2 ]
机构
[1] Univ Michigan, 4901 Evergreen Rd, Dearborn, MI 48128 USA
[2] Florida Atlantic Univ, Comp & Elect Engn & Comp Sci, Boca Raton, FL 33431 USA
关键词
defect prediction; software metrics; skewed data; machine learning; data sampling; boosting; CLASSIFICATION;
D O I
暂无
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software defect prediction is often employed as a cost-effective tool to focus quality-improvement resources on poor-quality program modules. Many software measurement and defect data sets are very skewed, where the proportion of not-fault-prone (majority class) modules is substantially larger than that of fault-prone (minority class) modules. Data sampling and Boosting (Boost) are useful techniques for alleviating this problem. While various Data Sampling techniques are available, our prior studies have shown Random Undersampling (RUS) to be effective. RUS works by randomly removing instances from the majority class of the training data. This paper investigates combining Data Sampling and Boosting (RUSBoost) for software defect prediction with skewed software measurement and defect data sets. We consider two variations of RUSBoost depending on the percentage of fault-prone modules desired in the post-sampling training data set. Labeled as RUSBoost_A and RUSBoost_B, they respectively represent whether 35% or 50% of the post-sampling training data set should be fault-prone modules. A case study of 15 data sets from multiple real-world projects is used to demonstrate that RUSBoost performs significantly better than a model built without any Data Sampling or Boosting technique. Moreover, RUSBoost_B significantly outperforms RUSBoost_A.
引用
收藏
页码:403 / +
页数:2
相关论文
共 50 条
  • [1] Aggregating Data Sampling with Feature Subset Selection to Address Skewed Software Defect Data
    Gao, Kehan
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2015, 25 (9-10) : 1531 - 1550
  • [2] Deep Learning Experiments with Skewed Data for Defect Prediction in Plastic Injection Molding
    Kim, Seongwoo
    Kim, Seyoung
    Ryu, Kwang Ryel
    [J]. 2018 IEEE/ACS 15TH INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2018,
  • [3] A novel under sampling strategy for efficient software defect analysis of skewed distributed data
    K. Nitalaksheswara Rao
    Ch. Satyananda Reddy
    [J]. Evolving Systems, 2020, 11 : 119 - 131
  • [4] A novel under sampling strategy for efficient software defect analysis of skewed distributed data
    Rao, K. Nitalaksheswara
    Reddy, Ch. Satyananda
    [J]. EVOLVING SYSTEMS, 2020, 11 (01) : 119 - 131
  • [5] Early Software Defect Prediction: Right-Shifting Software Effort Data into a Defect Curve
    Okumoto, Kazuhira
    [J]. 2022 IEEE INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING WORKSHOPS (ISSREW 2022), 2022, : 43 - 48
  • [6] Software Defect Prediction Based on Stability Test Data
    Okumoto, Kazu
    [J]. 2011 INTERNATIONAL CONFERENCE ON QUALITY, RELIABILITY, RISK, MAINTENANCE, AND SAFETY ENGINEERING (ICQR2MSE), 2011, : 385 - 387
  • [7] Feature Selection with Imbalanced Data for Software Defect Prediction
    Khoshgoftaar, Taghi M.
    Gao, Kehan
    [J]. EIGHTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2009, : 235 - +
  • [8] A Systematic Data Collection Procedure for Software Defect Prediction
    Mausa, Goran
    Grbac, Tihana Galinac
    Basic, Bojana Dalbelo
    [J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2016, 13 (01) : 173 - 197
  • [9] Imbalanced Data Processing Model for Software Defect Prediction
    Lijuan Zhou
    Ran Li
    Shudong Zhang
    Hua Wang
    [J]. Wireless Personal Communications, 2018, 102 : 937 - 950
  • [10] Imbalanced Data Processing Model for Software Defect Prediction
    Zhou, Lijuan
    Li, Ran
    Zhang, Shudong
    Wang, Hua
    [J]. WIRELESS PERSONAL COMMUNICATIONS, 2018, 102 (02) : 937 - 950