The Impact Study of Class Imbalance on the Performance of Software Defect Prediction Models

被引:0
|
作者
Yu Q. [1 ]
Jiang S.-J. [1 ,2 ]
Zhang Y.-M. [1 ,3 ]
Wang X.-Y. [1 ]
Gao P.-F. [1 ]
Qian J.-Y. [2 ]
机构
[1] School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, Jiangsu
[2] Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, Guangxi
[3] State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing
来源
Qian, Jun-Yan (qjy2000@gmail.com) | 2018年 / Science Press卷 / 41期
关键词
Class imbalance; Cost-sensitive learning; Ensemble learning; Imbalance ratio; Prediction models; Software defect prediction;
D O I
10.11897/SP.J.1016.2018.00809
中图分类号
学科分类号
摘要
Class imbalance refers to that the number of samples in different classes is unbalanced. In the process of software defect prediction, the performance of traditional prediction models may be affected by the class imbalance problem of datasets. In order to explore the impact of class imbalance on the performance of software defect prediction models, this paper presents an approach to analyzing the impact of class imbalance. First, an algorithm is designed to construct new datasets, which could convert an original imbalanced dataset into a set of new datasets with imbalance ratio increased one by one. Second, different classification models are selected as the defect prediction models to predict on these new constructed datasets respectively. Moreover, AUC metric is used to measure the classification performance of different prediction models. Finally, Coefficient of Variation (C•V) is applied to evaluate the performance stability of each prediction model with class imbalance. The empirical study is conducted on eight typical prediction models. The results show that the performance of three prediction models, C4.5, RIPPER and SMO, are decreased with the increasing of imbalance ratio. However, cost-sensitive learning and ensemble learning could improve their performance and performance stability with class imbalance. Compared with the above three models, the performance of Logistic Regression, Naive Bayes and Random Forest models are more stable. © 2018, Science Press. All right reserved.
引用
收藏
页码:809 / 824
页数:15
相关论文
共 58 条
  • [1] Liu Y., Loh H.T., Sun A., Imbalanced text classification: A term weighting approach, Expert Systems with Applications, 36, 1, pp. 690-701, (2009)
  • [2] Phua C., Alahakoon D., Lee V., Minority report in fraud detection: Classification of skewed data, ACM SIGKDD Explorations Newsletter, 6, 1, pp. 50-59, (2004)
  • [3] Mena L., Gonzalez J.A., Symbolic one-class learning from imbalanced datasets: Application in medical diagnosis, International Journal on Artificial Intelligence Tools, 18, 2, pp. 273-309, (2009)
  • [4] Peng L., Zhang H., Yang B., Et al., A new approach for imbalanced data classification based on data gravitation, Information Sciences, 288, pp. 347-373, (2014)
  • [5] Sun Z., Song Q., Zhu X., Et al., A novel ensemble method for classifying imbalanced data, Pattern Recognition, 48, 5, pp. 1623-1637, (2015)
  • [6] Lopez V., Fernandez A., Garcia S., Et al., An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences, 250, pp. 113-141, (2013)
  • [7] Chawla N.V., Bowyer K.W., Hall L.O., Et al., SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, pp. 321-357, (2002)
  • [8] Tahir M.A., Kittler J., Yan F., Inverse random under sampling for class imbalance problem and its application to multi-label classification, Pattern Recognition, 45, 10, pp. 3738-3750, (2012)
  • [9] Zhou Z.H., Liu X.Y., Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering, 18, 1, pp. 63-77, (2006)
  • [10] Siers M.J., Islam M.Z., Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem, Information Systems, 51, pp. 62-71, (2015)