The Impact Study of Class Imbalance on the Performance of Software Defect Prediction Models

被引:0
|
作者
Yu Q. [1 ]
Jiang S.-J. [1 ,2 ]
Zhang Y.-M. [1 ,3 ]
Wang X.-Y. [1 ]
Gao P.-F. [1 ]
Qian J.-Y. [2 ]
机构
[1] School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, Jiangsu
[2] Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, Guangxi
[3] State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing
来源
Qian, Jun-Yan (qjy2000@gmail.com) | 2018年 / Science Press卷 / 41期
关键词
Class imbalance; Cost-sensitive learning; Ensemble learning; Imbalance ratio; Prediction models; Software defect prediction;
D O I
10.11897/SP.J.1016.2018.00809
中图分类号
学科分类号
摘要
Class imbalance refers to that the number of samples in different classes is unbalanced. In the process of software defect prediction, the performance of traditional prediction models may be affected by the class imbalance problem of datasets. In order to explore the impact of class imbalance on the performance of software defect prediction models, this paper presents an approach to analyzing the impact of class imbalance. First, an algorithm is designed to construct new datasets, which could convert an original imbalanced dataset into a set of new datasets with imbalance ratio increased one by one. Second, different classification models are selected as the defect prediction models to predict on these new constructed datasets respectively. Moreover, AUC metric is used to measure the classification performance of different prediction models. Finally, Coefficient of Variation (C•V) is applied to evaluate the performance stability of each prediction model with class imbalance. The empirical study is conducted on eight typical prediction models. The results show that the performance of three prediction models, C4.5, RIPPER and SMO, are decreased with the increasing of imbalance ratio. However, cost-sensitive learning and ensemble learning could improve their performance and performance stability with class imbalance. Compared with the above three models, the performance of Logistic Regression, Naive Bayes and Random Forest models are more stable. © 2018, Science Press. All right reserved.
引用
收藏
页码:809 / 824
页数:15
相关论文
共 58 条
  • [41] Zhang F., Zheng Q., Zou Y., Et al., Cross-project defect prediction using a connectivity-based unsupervised classifier, Proceedings of the 38th International Conference on Software Engineering, pp. 309-320, (2016)
  • [42] McCabe T.J., A complexity measure, IEEE Transactions on Software Engineering, SE-2, 4, pp. 308-320, (1976)
  • [43] Halstead M.H., Elements of Software Science, (1977)
  • [44] Chidamber S.R., Kemerer C.F., A metrics suite for object oriented design, IEEE Transactions on Software Engineering, 20, 6, pp. 476-493, (1994)
  • [45] Yan M.-S., Zhou Z.-H., An empirical comparative study of cost-sensitive classification algorithms, Pattern Recognition and Artificial Intelligence, 18, 5, pp. 628-635, (2005)
  • [46] Qiao X., Liu Y., Adaptive weighted learning for unbalanced multicategory classification, Biometrics, 65, 1, pp. 159-168, (2009)
  • [47] Jiang Y., Cukic B., Menzies T., Cost curve evaluation of fault prediction models, Proceedings of the 19th International Symposium on Software Reliability Engineering, pp. 197-206, (2008)
  • [48] Freund Y., Schapire R.E., Experiments with a new boosting algorithm, Proceedings of the 13th International Conference on Machine Learning, pp. 148-156, (1996)
  • [49] Li X.-F., Li J., Dong Y.-F., Qu C.-W., A new learning algorithm for imbalanced data-PCBoost, Chinese Journal of Computers, 35, 2, pp. 202-209, (2012)
  • [50] Zheng J., Cost-sensitive boosting neural networks for software defect prediction, Expert Systems with Applications, 37, 6, pp. 4537-4543, (2010)