Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model

被引:0
|
作者
Samuel Njoroge Kigo
Evans Otieno Omondi
Bernard Oguna Omolo
机构
[1] Strathmore University,Institute of Mathematical Sciences
[2] African Population and Health Research Center,Division of Mathematics and Computer Science
[3] University of South Carolina–Upstate,Faculty of Health Sciences, School of Public Health
[4] University of the Witwatersrand,undefined
来源
关键词
D O I
暂无
中图分类号
学科分类号
摘要
This study conducted a comprehensive analysis of multiple supervised machine learning models, regressors and classifiers, to accurately predict diamond prices. Diamond pricing is a complex task due to the non-linear relationships between key features such as carat, cut, clarity, table, and depth. The analysis aimed to develop an accurate predictive model by utilizing both regression and classification approaches. To preprocess the data, the study employed various techniques. The work addressed outliers, standardized the predictors, performed median imputation of missing values, and resolved multicollinearity issues. Equal-width binning on the cut variable was performed to handle class imbalance. Correlation-based feature selection was utilized to eliminate highly correlated variables, ensuring that only relevant features were included in the models. Outliers were handled using the inter-quartile range method, and numerical features were normalized through standardization. Missing values in numerical features were imputed using the median, preserving the integrity of the dataset. Among the models evaluated, the RF regressor exhibited exceptional performance. It achieved the lowest root mean squared error (RMSE) of 523.50, indicating superior accuracy compared to the other models. The RF regressor also obtained a high R-squared (R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {R}^2$$\end{document}) score of 0.985, suggesting it explained a significant portion of the variance in diamond prices. Furthermore, the area under the curve with RF classifier for the test set was 1.00 (100%)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\, (100\%)$$\end{document}, indicating perfect classification performance. These results solidify the RF’s position as the best-performing model in terms of accuracy and predictive power, both in regression and classification. The MLP regressor showed promising results with an RMSE of 563.74 and an R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {R}^2$$\end{document} score of 0.980, demonstrating its ability to capture the complex relationships in the data. Although it achieved slightly higher errors than the RF regressor, further analysis is needed to determine its suitability and potential advantages compared to the RF regressor. The XGBoost Regressor achieved an RMSE of 612.88 and an R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {R}^2$$\end{document} score of 0.972, indicating its effectiveness in predicting diamond prices but with slightly higher errors compared to the RF regressor. The Boosted Decision Tree Regressor had an RMSE of 711.31 and an R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {R}^2$$\end{document} score of 0.968, demonstrating its ability to capture some of the underlying patterns but with higher errors than the RF and XGBoost models. In contrast, the KNN regressor yielded a higher RMSE of 1346.65 and a lower R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {R}^2$$\end{document} score of 0.887, indicating its inferior performance in accurately predicting diamond prices compared to the other models. Similarly, the Linear Regression model performed similarly to the KNN regressor, with an RMSE of 1395.41 and an R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {R}^2$$\end{document} score of 0.876. The Support Vector Regression model showed the highest RMSE of 3044.49 and the lowest R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {R}^2$$\end{document} score of 0.421, indicating its limited effectiveness in capturing the complex relationships in the data. Overall, the study demonstrates that the RF outperforms the other models in terms of accuracy and predictive power, as evidenced by its lowest RMSE, highest R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {R}^2$$\end{document} score, and perfect classification performance. This highlights its suitability for accurately predicting diamond prices. The study not only provides an effective tool for the diamond industry but also emphasizes the importance of considering both regression and classification approaches in developing accurate predictive models. The findings contribute valuable insights for pricing strategies, market trends, and decision-making processes in the diamond industry and related fields.
引用
收藏
相关论文
共 50 条
  • [1] Assessing predictive performance of supervised machine learning algorithms for a diamond pricing model
    Kigo, Samuel Njoroge
    Omondi, Evans Otieno
    Omolo, Bernard Oguna
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [2] Comparative Analysis of Supervised Machine Learning Algorithms to Build a Predictive Model for Evaluating Students' Performance
    El Guabassi, Inssaf
    Bousalem, Zakaria
    Marah, Rim
    Qazdar, Aimad
    [J]. INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2021, 17 (02) : 90 - 105
  • [3] Supervised Rainfall Learning Model Using Machine Learning Algorithms
    Sharma, Amit Kumar
    Chaurasia, Sandeep
    Srivastava, Devesh Kumar
    [J]. INTERNATIONAL CONFERENCE ON ADVANCED MACHINE LEARNING TECHNOLOGIES AND APPLICATIONS (AMLTA2018), 2018, 723 : 275 - 283
  • [4] Predictive Analytics of Sensor Data Based on Supervised Machine Learning Algorithms
    Gupta, Shreya
    Mittal, Mohit
    Padha, Anupama
    [J]. 2017 INTERNATIONAL CONFERENCE ON NEXT GENERATION COMPUTING AND INFORMATION SYSTEMS (ICNGCIS), 2017, : 171 - 176
  • [5] Performance Evaluation of Supervised Machine Learning Algorithms for Intrusion Detection
    Belavagi, Manjula C.
    Muniyal, Balachandra
    [J]. TWELFTH INTERNATIONAL CONFERENCE ON COMMUNICATION NETWORKS, ICCN 2016 / TWELFTH INTERNATIONAL CONFERENCE ON DATA MINING AND WAREHOUSING, ICDMW 2016 / TWELFTH INTERNATIONAL CONFERENCE ON IMAGE AND SIGNAL PROCESSING, ICISP 2016, 2016, 89 : 117 - 123
  • [6] Performance Analysis of Supervised Machine Learning Algorithms for Text Classification
    Mishu, Sadia Zaman
    Rafiuddin, S. M.
    [J]. PROCEEDINGS OF THE 2016 19TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2016, : 409 - 413
  • [7] High Accuracy Predictive Model on Breast Cancer Using Ensemble Approach of Supervised Machine Learning Algorithms
    Kaul, Chaitanya
    Sharma, Neeraj
    [J]. 2021 INTERNATIONAL CONFERENCE ON COMPUTATIONAL PERFORMANCE EVALUATION (COMPE-2021), 2021, : 71 - +
  • [8] Prediction Model for Bollywood Movie Success: A Comparative Analysis of Performance of Supervised Machine Learning Algorithms
    Verma, Hemraj
    Verma, Garima
    [J]. REVIEW OF SOCIONETWORK STRATEGIES, 2020, 14 (01): : 1 - 17
  • [9] Prediction Model for Bollywood Movie Success: A Comparative Analysis of Performance of Supervised Machine Learning Algorithms
    Hemraj Verma
    Garima Verma
    [J]. The Review of Socionetwork Strategies, 2020, 14 : 1 - 17
  • [10] A Review of Supervised Machine Learning Algorithms
    Singh, Amanpreet
    Thakur, Narina
    Sharma, Aakanksha
    [J]. PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 1310 - 1315