Aggregating Data Sampling with Feature Subset Selection to Address Skewed Software Defect Data

被引:4
|
作者
Gao, Kehan [1 ]
Khoshgoftaar, Taghi M. [2 ]
Napolitano, Amri [2 ]
机构
[1] Eastern Connecticut State Univ, Dept Math & Comp Sci, Willimantic, CT 06226 USA
[2] Florida Atlantic Univ, Dept Comp & Elect Engn & Comp Sci, Boca Raton, FL 33431 USA
基金
美国国家科学基金会;
关键词
Feature subset selection; data sampling; high dimensionality; class imbalance; software defect prediction;
D O I
10.1142/S0218194015400318
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Defect prediction is an important process activity frequently used for improving the quality and reliability of software products. Defect prediction results provide a list of fault-prone modules which are necessary in helping project managers better utilize valuable project resources. In the software quality modeling process, high dimensionality and class imbalance are the two potential problems that may exist in data repositories. In this study, we investigate three data preprocessing approaches, in which feature selection is combined with data sampling, to overcome these problems in the context of software quality estimation. These three approaches are: Approach 1 - sampling performed prior to feature selection, but retaining the unsampled data instances; Approach 2 - sampling performed prior to feature selection, retaining the sampled data instances; and Approach 3 - sampling performed after feature selection. A comparative investigation is presented for evaluating the three approaches. In the experiments, we employed three sampling methods (random undersampling, random oversampling, and synthetic minority oversampling), each combined with a filter-based feature subset selection technique called correlation-based feature selection. We built the defect prediction models using five common classification algorithms. The case study was based on software metrics and defect data collected from multiple releases of a real-world software system. The results demonstrated that the type of sampling methods used in data preprocessing significantly affected the performance of the combination approaches. It was found that when the random undersampling technique was used, Approach 1 performed better than the other two approaches. However, when the feature selection technique was used in conjunction with an oversampling method (random oversampling or synthetic minority oversampling), we strongly recommended Approach 3.
引用
收藏
页码:1531 / 1550
页数:20
相关论文
共 50 条
  • [1] An Empirical Investigation of Combining Filter-Based Feature Subset Selection and Data Sampling for Software Defect Prediction
    Gao, Kehan
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    [J]. INTERNATIONAL JOURNAL OF RELIABILITY QUALITY AND SAFETY ENGINEERING, 2015, 22 (06)
  • [2] Impact of Data Sampling on Feature Selection Techniques for Software Defect Prediction
    Gao, Kehan
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    [J]. PROCEEDINGS 18TH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY & QUALITY IN DESIGN, 2012, : 91 - +
  • [3] Software Defect Prediction with Skewed Data
    Seliya, Naeem
    Khoshgoftaar, Taghi M.
    [J]. 16TH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY AND QUALITY IN DESIGN, 2010, : 403 - +
  • [4] Impact of Data Sampling on Stability of Feature Selection for Software Measurement Data
    Gao, Kehan
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    [J]. 2011 23RD IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2011), 2011, : 1004 - 1011
  • [5] Feature Selection with Imbalanced Data for Software Defect Prediction
    Khoshgoftaar, Taghi M.
    Gao, Kehan
    [J]. EIGHTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2009, : 235 - +
  • [6] A novel under sampling strategy for efficient software defect analysis of skewed distributed data
    K. Nitalaksheswara Rao
    Ch. Satyananda Reddy
    [J]. Evolving Systems, 2020, 11 : 119 - 131
  • [7] A novel under sampling strategy for efficient software defect analysis of skewed distributed data
    Rao, K. Nitalaksheswara
    Reddy, Ch. Satyananda
    [J]. EVOLVING SYSTEMS, 2020, 11 (01) : 119 - 131
  • [8] Feature subset selection for data and feature streams: a review
    Carlos Villa-Blanco
    Concha Bielza
    Pedro Larrañaga
    [J]. Artificial Intelligence Review, 2023, 56 : 1011 - 1062
  • [9] Feature subset selection for data and feature streams: a review
    Villa-Blanco, Carlos
    Bielza, Concha
    Larranaga, Pedro
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2023, 56 (SUPPL 1) : 1011 - 1062
  • [10] Assessments of Feature Selection Techniques with Respect to Data Sampling for Highly Imbalanced Software Measurement Data
    Gao, Kehan
    Khoshgoftaar, Taghi M.
    [J]. INTERNATIONAL JOURNAL OF RELIABILITY QUALITY AND SAFETY ENGINEERING, 2015, 22 (02)