An empirical analysis of data preprocessing for machine learning-based software cost estimation

被引:110
|
作者
Huang, Jianglin [1 ]
Li, Yan-Fu [2 ]
Xie, Min [1 ]
机构
[1] City Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Hong Kong, Peoples R China
[2] CentraleSupelec, Dept Ind Engn, Paris, France
关键词
Software cost estimation; Data preprocessing; Missing-data treatments; Scaling; Feature selection; Case selection; SUPPORT VECTOR REGRESSION; MISSING DATA; MUTUAL INFORMATION; FEATURE-SELECTION; PREDICTION; MODELS; IMPUTATION; WEIGHTS; SIZE;
D O I
10.1016/j.infsof.2015.07.004
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Context: Due to the complex nature of software development process, traditional parametric models and statistical methods often appear to be inadequate to model the increasingly complicated relationship between project development cost and the project features (or cost drivers). Machine learning (ML) methods, with several reported successful applications, have gained popularity for software cost estimation in recent years. Data preprocessing has been claimed by many researchers as a fundamental stage of ML methods; however, very few works have been focused on the effects of data preprocessing techniques. Objective: This study aims for an empirical assessment of the effectiveness of data preprocessing techniques on ML methods in the context of software cost estimation. Method: In this work, we first conduct a literature survey of the recent publications using data preprocessing techniques, followed by a systematic empirical study to analyze the strengths and weaknesses of individual data preprocessing techniques as well as their combinations. Results: Our results indicate that data preprocessing techniques may significantly influence the final prediction. They sometimes might have negative impacts on prediction performance of ML methods. Conclusion: In order to reduce prediction errors and improve efficiency, a careful selection is necessary according to the characteristics of machine learning methods, as well as the datasets used for software cost estimation. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:108 / 127
页数:20
相关论文
共 50 条
  • [41] Confidence Estimation for Machine Learning-Based Quantitative Photoacoustics
    Groehl, Janek
    Kirchner, Thomas
    Adler, Tim
    Maier-Hein, Lena
    JOURNAL OF IMAGING, 2018, 4 (12)
  • [42] Machine learning-based bladder effusion estimation model construction on intravesical pressure data
    Yuan, Gang
    Li, Yu
    Ge, Zicong
    Yang, Xiaodong
    Zheng, Jian
    Wu, Zhongyi
    Zhang, Yin
    Zhang, Wanlu
    Tang, Liangfeng
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 86
  • [43] Machine learning-based area estimation using data measured under walking conditions
    Nakayama, Shota
    Aikawa, Satoru
    Yamamoto, Shinichiro
    IEICE COMMUNICATIONS EXPRESS, 2024, 13 (06): : 172 - 175
  • [44] EMPIRICAL COMPARISON AND ANALYSIS OF MACHINE LEARNING-BASED PREDICTORS FOR PREDICTING AND ANALYZING OF THERMOPHILIC PROTEINS
    Charoenkwan, Phasit
    Schaduangrat, Nalini
    Hasan, Md Mehedi
    Moni, Mohammad Ali
    Lio, Pietro
    Shoombuatong, Watshara
    EXCLI JOURNAL, 2022, 21 : 554 - 570
  • [45] Data preprocessing for machine-learning-based adaptive data center transmission
    Keykhosravi, Kamran
    Hamednia, Ahad
    Rastegarfar, Houman
    Agrell, Erik
    ICT EXPRESS, 2022, 8 (01): : 37 - 43
  • [46] Software Design Decisions for Greener Machine Learning-based Systems
    del Rey, Santiago
    PROCEEDINGS 2024 IEEE/ACM 3RD INTERNATIONAL CONFERENCE ON AI ENGINEERING-SOFTWARE ENGINEERING FOR AI, CAIN 2024, 2024, : 256 - 258
  • [47] Machine Learning-Based Multipath Routing for Software Defined Networks
    Mohamad Khattar Awad
    Marwa Hassan Hafez Ahmed
    Ali F. Almutairi
    Imtiaz Ahmad
    Journal of Network and Systems Management, 2021, 29
  • [48] Machine Learning-Based Multipath Routing for Software Defined Networks
    Awad, Mohamad Khattar
    Ahmed, Marwa Hassan Hafez
    Almutairi, Ali F.
    Ahmad, Imtiaz
    JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2021, 29 (02)
  • [49] Prediction of software quality with Machine Learning-Based ensemble methods
    Ceran A.A.
    Ar Y.
    Tanrıöver Ö.Ö.
    Seyrek Ceran S.
    Materials Today: Proceedings, 2023, 81 : 18 - 25
  • [50] Machine Learning-Based System for Detecting Unseen Malicious Software
    Bisio, Federica
    Gastaldo, Paolo
    Meda, Claudia
    Nasta, Stefano
    Zunino, Rodolfo
    APPLICATIONS IN ELECTRONICS PERVADING INDUSTRY, ENVIRONMENT AND SOCIETY, APPLEPIES 2014, 2016, 351 : 9 - 15