Early Prediction of Diabetes Using an Ensemble of Machine Learning Models

被引:17
|
作者
Dutta, Aishwariya [1 ,2 ]
Hasan, Md Kamrul [3 ]
Ahmad, Mohiuddin [3 ]
Awal, Md Abdul [4 ,5 ]
Islam, Md Akhtarul [6 ]
Masud, Mehedi [7 ]
Meshref, Hossam [7 ]
机构
[1] Khulna Univ Engn & Technol KUET, Dept Biomed Engn BME, Khulna 9203, Bangladesh
[2] Mil Inst Sci & Technol MIST, Dept Biomed Engn BME, Dhaka 1216, Bangladesh
[3] Khulna Univ Engn & Technol KUET, Dept Elect & Elect Engn EEE, Khulna 9203, Bangladesh
[4] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Qld 4072, Australia
[5] Khulna Univ KU, Elect & Commun Engn ECE Discipline, Khulna 9208, Bangladesh
[6] Khulna Univ KU, Stat Discipline, Khulna 9208, Bangladesh
[7] Taif Univ, Coll Comp & Informat Technol, Dept Comp Sci, POB 11099, Taif 21944, Saudi Arabia
关键词
artificial intelligence; diabetes prediction; ensemble ML classifier; filling missing value; outlier rejection; South Asian diabetes dataset; SYSTEMATIC ANALYSIS; CLASSIFICATION; PREVALENCE;
D O I
10.3390/ijerph191912378
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Diabetes is one of the most rapidly spreading diseases in the world, resulting in an array of significant complications, including cardiovascular disease, kidney failure, diabetic retinopathy, and neuropathy, among others, which contribute to an increase in morbidity and mortality rate. If diabetes is diagnosed at an early stage, its severity and underlying risk factors can be significantly reduced. However, there is a shortage of labeled data and the occurrence of outliers or data missingness in clinical datasets that are reliable and effective for diabetes prediction, making it a challenging endeavor. Therefore, we introduce a newly labeled diabetes dataset from a South Asian nation (Bangladesh). In addition, we suggest an automated classification pipeline that includes a weighted ensemble of machine learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). Grid search hyperparameter optimization is employed to tune the critical hyperparameters of these ML models. Furthermore, missing value imputation, feature selection, and K-fold cross-validation are included in the framework design. A statistical analysis of variance (ANOVA) test reveals that the performance of diabetes prediction significantly improves when the proposed weighted ensemble (DT + RF + XGB + LGB) is executed with the introduced preprocessing, with the highest accuracy of 0.735 and an area under the ROC curve (AUC) of 0.832. In conjunction with the suggested ensemble model, our statistical imputation and RF-based feature selection techniques produced the best results for early diabetes prediction. Moreover, the presented new dataset will contribute to developing and implementing robust ML models for diabetes prediction utilizing population-level data.
引用
收藏
页数:25
相关论文
共 50 条
  • [31] Obesity Prediction Using Ensemble Machine Learning Approaches
    Jindal, Kapil
    Baliyan, Niyati
    Rana, Prashant Singh
    RECENT FINDINGS IN INTELLIGENT COMPUTING TECHNIQUES, VOL 2, 2018, 708 : 355 - 362
  • [32] Oil Price Prediction Using Ensemble Machine Learning
    Gabralla, Lubna A.
    Jammazi, Rania
    Abraham, Ajith
    2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONICS ENGINEERING (ICCEEE), 2013, : 674 - 679
  • [33] Pitch Accent Prediction Using Ensemble Machine Learning
    Zhang, Aiying
    Ni, Chongjia
    ICICTA: 2009 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTATION TECHNOLOGY AND AUTOMATION, VOL I, PROCEEDINGS, 2009, : 444 - 447
  • [34] Optimal Spatial Prediction Using Ensemble Machine Learning
    Davies, Molly Margaret
    van der Laan, Mark J.
    INTERNATIONAL JOURNAL OF BIOSTATISTICS, 2016, 12 (01): : 179 - 201
  • [35] Early Diabetes Prediction Based on Stacking Ensemble Learning Model
    Liu, JiMin
    Fan, LuHao
    Jia, QuanQiu
    Wen, LongRi
    Shi, ChengFeng
    PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 2687 - 2692
  • [36] Performance Comparison of Machine Learning Models for Diabetes Prediction
    Cihan, Pinar
    Coskun, Hakan
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [37] Optimized hybrid machine learning framework for early diabetes prediction using electrogastrograms
    Paramasivam Alagumariappan
    Malathy Sathyamoorthy
    Rajesh Kumar Dhanaraj
    K. Kamalanand
    C. Emmanuel
    Sarah Allabun
    Manal Othman
    Masresha Getahun
    Ben Othman Soufiene
    Scientific Reports, 15 (1)
  • [38] Prediction of lung papillary adenocarcinoma-specific survival using ensemble machine learning models
    Xia, Kaide
    Chen, Dinghua
    Jin, Shuai
    Yi, Xinglin
    Luo, Li
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [39] Prediction of lung papillary adenocarcinoma-specific survival using ensemble machine learning models
    Kaide Xia
    Dinghua Chen
    Shuai Jin
    Xinglin Yi
    Li Luo
    Scientific Reports, 13
  • [40] Dynamic prediction of landslide life expectancy using ensemble system incorporating classical prediction models and machine learning
    Liu, Lei-Lei
    Yin, Hao-Dong
    Xiao, Ting
    Huang, Lei
    Cheng, Yung-Ming
    GEOSCIENCE FRONTIERS, 2024, 15 (02)