A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction

被引:17
|
作者
Gong, Lina [1 ]
Zhang, Haoxiang [2 ]
Zhang, Jingxuan [1 ]
Wei, Mingqiang [1 ]
Huang, Zhiqiu [1 ]
机构
[1] Nanjing Univ Aeronaut & Astronaut, Coll Comp Sci & Technol, Nanjing 210095, Jiangsu, Peoples R China
[2] Queens Univ, Sch Comp, Software Anal & Intelligence Lab SAIL, Kingston, ON K7L 3N6, Canada
基金
中国国家自然科学基金;
关键词
Class overlap; data quality; k-nearest neighbourhood; local analysis; software defect prediction; software metrics; FALSE DISCOVERY RATE; CLASSIFICATION; CLASSIFIERS; MACHINE; ERROR;
D O I
10.1109/TSE.2022.3220740
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software Defect Prediction (SDP) is one of the most vital and cost-efficient operations to ensure the software quality. However, there exists the phenomenon of class overlap in the SDP datasets (i.e., defective and non-defective modules are similar in terms of values of metrics), which hinders the performance as well as the use of SDP models. Even though efforts have been made to investigate the impact of removing overlapping technique on the performance of SDP, many open issues are still challenging yet unknown. Therefore, we conduct an empirical study to comprehensively investigate the impact of class overlap on SDP. Specifically, we first propose an overlapping instances identification approach by analyzing the class distribution in the local neighborhood of a given instance. We then investigate the impact of class overlap and two common overlapping instance handling techniques on the performance and the interpretation of seven representative SDP models. Through an extensive case study on 230 diversity datasets, we observe that: i) 70.0% of SDP datasets contain overlapping instances; ii) different levels of class overlap have different impacts on the performance of SDP models; iii) class overlap affects the rank of the important feature list of SDP models, particularly the feature lists at the top 2 and top 3 ranks; IV) Class overlap handling techniques could statistically significantly improve the performance of SDP models trained on datasets with over 12.5% overlap ratios. We suggest that future work should apply our KNN method to identify the overlap ratios of datasets before building SDP models.
引用
收藏
页码:2440 / 2458
页数:19
相关论文
共 50 条
  • [31] Influence Analysis Method of Class Imbalance on Software Defect Prediction Model Stability and Prediction Performance
    Zhang, Yan-Mei
    Zhi, Sheng-Lin
    Jiang, Shu-Juan
    Yuan, Guan
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2023, 51 (08): : 2076 - 2087
  • [32] Software defect prediction based on correlation weighted class association rule mining
    Shao, Yuanxun
    Liu, Bin
    Wang, Shihai
    Li, Guoqi
    [J]. KNOWLEDGE-BASED SYSTEMS, 2020, 196
  • [33] Cross-Project Software Defect Prediction Based on Class Code Similarity
    Wen, Wanzhi
    Shen, Chenqiang
    Lu, Xiaohong
    Li, Zhixian
    Wang, Haoren
    Zhang, Ruinian
    Zhu, Ningbo
    [J]. IEEE ACCESS, 2022, 10 : 105485 - 105495
  • [34] Class Imbalance Learning to Heterogeneous Cross-Software Projects Defect Prediction
    Vashisht, Rohit
    Rizvi, Syed Afzal Murtaza
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE INNOVATION, 2022, 10 (01)
  • [35] Class Balancing Approaches in Dataset for Software Defect Prediction: A Systematic Literature Review
    Olvera-Villeda, Dan Javier
    Sanchez-Garcia, Angel J.
    Limon, Xavier
    Dominguez Isidro, Saul
    [J]. 2023 11TH INTERNATIONAL CONFERENCE IN SOFTWARE ENGINEERING RESEARCH AND INNOVATION, CONISOFT 2023, 2023, : 236 - 245
  • [36] A Hybrid Approach to Coping with High Dimensionality and Class Imbalance for Software Defect Prediction
    Gao, Kehan
    Khoshgoftaar, Taghi
    Napolitano, Amri
    [J]. 2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 281 - 288
  • [37] Adaptive Centre-Weighted Oversampling for Class Imbalance in Software Defect Prediction
    Zhao, Qi
    Yan, Xuefeng
    Zhou, Yong
    [J]. 2018 IEEE INT CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, UBIQUITOUS COMPUTING & COMMUNICATIONS, BIG DATA & CLOUD COMPUTING, SOCIAL COMPUTING & NETWORKING, SUSTAINABLE COMPUTING & COMMUNICATIONS, 2018, : 223 - 230
  • [38] Which type of metrics are useful to deal with class imbalance in software defect prediction?
    Ozturk, Muhammed Maruf
    [J]. INFORMATION AND SOFTWARE TECHNOLOGY, 2017, 92 : 17 - 29
  • [39] Defect prediction for embedded software
    Oral, Atac Deniz
    Bener, Ayse Basar
    [J]. 2007 22ND INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, 2007, : 346 - 351
  • [40] Research on software defect prediction
    Wang, Qing
    Wu, Shu-Jian
    Li, Ming-Shu
    [J]. Ruan Jian Xue Bao/Journal of Software, 2008, 19 (07): : 1565 - 1580