An empirical analysis of data preprocessing for machine learning-based software cost estimation

被引:110
|
作者
Huang, Jianglin [1 ]
Li, Yan-Fu [2 ]
Xie, Min [1 ]
机构
[1] City Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China
[2] CentraleSupelec, Dept Ind Engn, Paris, France
关键词
Software cost estimation; Data preprocessing; Missing-data treatments; Scaling; Feature selection; Case selection; SUPPORT VECTOR REGRESSION; MISSING DATA; MUTUAL INFORMATION; FEATURE-SELECTION; PREDICTION; MODELS; IMPUTATION; WEIGHTS; SIZE;
D O I
10.1016/j.infsof.2015.07.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Due to the complex nature of software development process, traditional parametric models and statistical methods often appear to be inadequate to model the increasingly complicated relationship between project development cost and the project features (or cost drivers). Machine learning (ML) methods, with several reported successful applications, have gained popularity for software cost estimation in recent years. Data preprocessing has been claimed by many researchers as a fundamental stage of ML methods; however, very few works have been focused on the effects of data preprocessing techniques. Objective: This study aims for an empirical assessment of the effectiveness of data preprocessing techniques on ML methods in the context of software cost estimation. Method: In this work, we first conduct a literature survey of the recent publications using data preprocessing techniques, followed by a systematic empirical study to analyze the strengths and weaknesses of individual data preprocessing techniques as well as their combinations. Results: Our results indicate that data preprocessing techniques may significantly influence the final prediction. They sometimes might have negative impacts on prediction performance of ML methods. Conclusion: In order to reduce prediction errors and improve efficiency, a careful selection is necessary according to the characteristics of machine learning methods, as well as the datasets used for software cost estimation. (C) 2015 Elsevier B.V. All rights reserved.
引用
下载
收藏
页码:108 / 127
页数:20
相关论文
共 50 条
  • [1] Review and Empirical Analysis of Machine Learning-Based Software Effort Estimation
    Rahman, Mizanur
    Sarwar, Hasan
    Kader, MD. Abdul
    Goncalves, Teresa
    Tin, Ting Tin
    IEEE ACCESS, 2024, 12 : 85661 - 85680
  • [2] Machine Learning-based Software Effort Estimation : An Analysis
    Polkowski, Zdzislaw
    Vora, Jayneel
    Tanwar, Sudeep
    Tyagi, Sudhanshu
    Singh, Pradeep Kumar
    Singh, Yashwant
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTERS AND ARTIFICIAL INTELLIGENCE (ECAI-2019), 2019,
  • [3] A Machine Learning Based Model for Software Cost Estimation
    Tayyab, Muhammad Raza
    Usman, Muhammad
    Ahmad, Waseem
    PROCEEDINGS OF SAI INTELLIGENT SYSTEMS CONFERENCE (INTELLISYS) 2016, VOL 2, 2018, 16 : 402 - 414
  • [4] Machine Learning-based Identification of Contaminated Images in Light Curve Data Preprocessing
    Li, Hui
    Li, Rong-Wang
    Shu, Peng
    Li, Yu-Qiang
    RESEARCH IN ASTRONOMY AND ASTROPHYSICS, 2024, 24 (04)
  • [5] Machine Learning-based Identification of Contaminated Images in Light Curve Data Preprocessing
    Hui Li
    Rong-Wang Li
    Peng Shu
    Yu-Qiang Li
    Research in Astronomy and Astrophysics, 2024, 24 (04) : 289 - 297
  • [6] An empirical study of pattern leakage impact during data preprocessing on machine learning-based intrusion detection models reliability
    Bouke, Mohamed Aly
    Abdullah, Azizol
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 230
  • [7] An Empirical Analysis on Software Development Efforts Estimation in Machine Learning Perspective
    Rehman, Israr Ur
    Ali, Zulfiqar
    Jan, Zahoor
    ADCAIJ-ADVANCES IN DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE JOURNAL, 2021, 10 (03): : 227 - 240
  • [8] Performance Analysis on Machine Learning-Based Channel Estimation
    Mei, Kai
    Liu, Jun
    Zhang, Xiaochen
    Rajatheva, Nandana
    Wei, Jibo
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2021, 69 (08) : 5183 - 5193
  • [9] Machine Learning Models for Software Cost Estimation
    Al Asheeri, Mahmood Mohd
    Hammad, Mustafa
    2019 INTERNATIONAL CONFERENCE ON INNOVATION AND INTELLIGENCE FOR INFORMATICS, COMPUTING, AND TECHNOLOGIES (3ICT), 2019,
  • [10] An Empirical Analysis of Three-stage Data-Preprocessing for Analogy-based Software Effort Estimation on the ISBSG Data
    Huang, Jianglin
    Li, Yan-Fu
    Keung, Jacky Wai
    Yu, Y. T.
    Chan, W. K.
    2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS), 2017, : 442 - 449