An empirical analysis of data preprocessing for machine learning-based software cost estimation

被引:110
|
作者
Huang, Jianglin [1 ]
Li, Yan-Fu [2 ]
Xie, Min [1 ]
机构
[1] City Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China
[2] CentraleSupelec, Dept Ind Engn, Paris, France
关键词
Software cost estimation; Data preprocessing; Missing-data treatments; Scaling; Feature selection; Case selection; SUPPORT VECTOR REGRESSION; MISSING DATA; MUTUAL INFORMATION; FEATURE-SELECTION; PREDICTION; MODELS; IMPUTATION; WEIGHTS; SIZE;
D O I
10.1016/j.infsof.2015.07.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Due to the complex nature of software development process, traditional parametric models and statistical methods often appear to be inadequate to model the increasingly complicated relationship between project development cost and the project features (or cost drivers). Machine learning (ML) methods, with several reported successful applications, have gained popularity for software cost estimation in recent years. Data preprocessing has been claimed by many researchers as a fundamental stage of ML methods; however, very few works have been focused on the effects of data preprocessing techniques. Objective: This study aims for an empirical assessment of the effectiveness of data preprocessing techniques on ML methods in the context of software cost estimation. Method: In this work, we first conduct a literature survey of the recent publications using data preprocessing techniques, followed by a systematic empirical study to analyze the strengths and weaknesses of individual data preprocessing techniques as well as their combinations. Results: Our results indicate that data preprocessing techniques may significantly influence the final prediction. They sometimes might have negative impacts on prediction performance of ML methods. Conclusion: In order to reduce prediction errors and improve efficiency, a careful selection is necessary according to the characteristics of machine learning methods, as well as the datasets used for software cost estimation. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:108 / 127
页数:20
相关论文
共 50 条
  • [31] Machine learning-based extrapolation of crop cultivation cost
    Bari, Poonam
    Ragha, Lata
    INTELIGENCIA ARTIFICIAL-IBEROAMERICAN JOURNAL OF ARTIFICIAL INTELLIGENCE, 2024, 27 (74): : 80 - 101
  • [32] Software Defect Prediction using Propositionalization based Data Preprocessing: An Empirical Study
    Pak, CholMyong
    Wang, Tian Tian
    Su, Xiao Hong
    2ND INTERNATIONAL CONFERENCE ON DATA SCIENCE AND BUSINESS ANALYTICS (ICDSBA 2018), 2018, : 71 - 77
  • [33] Learning Systems: Machine-Learning in Software Products and Learning-Based Analysis of Software Systems Special Track at ISoLA 2016
    Howar, Falk
    Meinke, Karl
    Rausch, Andreas
    LEVERAGING APPLICATIONS OF FORMAL METHODS, VERIFICATION AND VALIDATION: DISCUSSION, DISSEMINATION, APPLICATIONS, ISOLA 2016, PT II, 2016, 9953 : 651 - 654
  • [34] Machine Learning-Based Smart Home Data Analysis and Forecasting Method
    Park, Sanguk
    2023 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS, ICCE, 2023,
  • [35] Machine Learning-Based Field Data Analysis and Modeling for Drone Communications
    Shan, Lin
    Miura, Ryu
    Kagawa, Toshinori
    Ono, Fumie
    Li, Huan-Bang
    Kojima, Fumihide
    IEEE ACCESS, 2019, 7 : 79127 - 79135
  • [36] Support vector machine learning-based fMRI data group analysis
    Wang, Ze
    Childress, Anna R.
    Wang, Jiongjiong
    Detre, Jobn A.
    NEUROIMAGE, 2007, 36 (04) : 1139 - 1151
  • [37] Data Representation in Machine Learning-Based Sentiment Analysis of Customer Reviews
    Shamshurin, Ivan
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, 2011, 6744 : 254 - 260
  • [38] Source Number Estimation via Machine Learning Based on Eigenvalue Preprocessing
    Zhou, Shuai
    Li, Tao
    Li, Yongzhao
    Zhang, Rui
    Ruan, Yuhan
    IEEE COMMUNICATIONS LETTERS, 2022, 26 (10) : 2360 - 2364
  • [39] Machine learning-based software sensors for machine state monitoring-The role of SMOTE-based data augmentation
    Kummer, Alex
    Ruppert, Tamas
    Medvegy, Tibor
    Abonyi, Janos
    RESULTS IN ENGINEERING, 2022, 16
  • [40] Review of machine learning-based Mineral Resource estimation
    Mahoob, M. A.
    Celik, T.
    Genc, B.
    JOURNAL OF THE SOUTHERN AFRICAN INSTITUTE OF MINING AND METALLURGY, 2022, 122 (11) : 655 - 664