Early Prediction of Diabetes Using an Ensemble of Machine Learning Models

被引:17
|
作者
Dutta, Aishwariya [1 ,2 ]
Hasan, Md Kamrul [3 ]
Ahmad, Mohiuddin [3 ]
Awal, Md Abdul [4 ,5 ]
Islam, Md Akhtarul [6 ]
Masud, Mehedi [7 ]
Meshref, Hossam [7 ]
机构
[1] Khulna Univ Engn & Technol KUET, Dept Biomed Engn BME, Khulna 9203, Bangladesh
[2] Mil Inst Sci & Technol MIST, Dept Biomed Engn BME, Dhaka 1216, Bangladesh
[3] Khulna Univ Engn & Technol KUET, Dept Elect & Elect Engn EEE, Khulna 9203, Bangladesh
[4] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Qld 4072, Australia
[5] Khulna Univ KU, Elect & Commun Engn ECE Discipline, Khulna 9208, Bangladesh
[6] Khulna Univ KU, Stat Discipline, Khulna 9208, Bangladesh
[7] Taif Univ, Coll Comp & Informat Technol, Dept Comp Sci, POB 11099, Taif 21944, Saudi Arabia
关键词
artificial intelligence; diabetes prediction; ensemble ML classifier; filling missing value; outlier rejection; South Asian diabetes dataset; SYSTEMATIC ANALYSIS; CLASSIFICATION; PREVALENCE;
D O I
10.3390/ijerph191912378
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Diabetes is one of the most rapidly spreading diseases in the world, resulting in an array of significant complications, including cardiovascular disease, kidney failure, diabetic retinopathy, and neuropathy, among others, which contribute to an increase in morbidity and mortality rate. If diabetes is diagnosed at an early stage, its severity and underlying risk factors can be significantly reduced. However, there is a shortage of labeled data and the occurrence of outliers or data missingness in clinical datasets that are reliable and effective for diabetes prediction, making it a challenging endeavor. Therefore, we introduce a newly labeled diabetes dataset from a South Asian nation (Bangladesh). In addition, we suggest an automated classification pipeline that includes a weighted ensemble of machine learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). Grid search hyperparameter optimization is employed to tune the critical hyperparameters of these ML models. Furthermore, missing value imputation, feature selection, and K-fold cross-validation are included in the framework design. A statistical analysis of variance (ANOVA) test reveals that the performance of diabetes prediction significantly improves when the proposed weighted ensemble (DT + RF + XGB + LGB) is executed with the introduced preprocessing, with the highest accuracy of 0.735 and an area under the ROC curve (AUC) of 0.832. In conjunction with the suggested ensemble model, our statistical imputation and RF-based feature selection techniques produced the best results for early diabetes prediction. Moreover, the presented new dataset will contribute to developing and implementing robust ML models for diabetes prediction utilizing population-level data.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] A practical framework for early detection of diabetes using ensemble machine learning models
    Saihood, Qusay
    Sonuc, Emrullah
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2023, 31 (04) : 722 - 738
  • [2] Prediction of diabetes disease using an ensemble of machine learning multi-classifier models
    Abnoosian, Karlo
    Farnoosh, Rahman
    Behzadi, Mohammad Hassan
    [J]. BMC BIOINFORMATICS, 2023, 24 (01)
  • [3] Prediction of diabetes disease using an ensemble of machine learning multi-classifier models
    Karlo Abnoosian
    Rahman Farnoosh
    Mohammad Hassan Behzadi
    [J]. BMC Bioinformatics, 24
  • [4] Integrating ensemble and machine learning models for early prediction of pneumonia mortality using laboratory tests
    Baik, Seung Min
    Hong, Kyung Sook
    Lee, Jae-Myeong
    Park, Dong Jin
    [J]. HELIYON, 2024, 10 (14)
  • [5] The early prediction of gestational diabetes mellitus by machine learning models
    Kaya, Yeliz
    Butun, Zafer
    Celik, Ozer
    Salik, Ece Akca
    Tahta, Tugba
    Yavuz, Arzu Altun
    [J]. BMC PREGNANCY AND CHILDBIRTH, 2024, 24 (01)
  • [6] Susceptibility Prediction of Groundwater Hardness Using Ensemble Machine Learning Models
    Mosavi, Amirhosein
    Hosseini, Farzaneh Sajedi
    Choubin, Bahram
    Abdolshahnejad, Mahsa
    Gharechaee, Hamidreza
    Lahijanzadeh, Ahmadreza
    Dineva, Adrienn A.
    [J]. WATER, 2020, 12 (10)
  • [7] Early Stage DRC Prediction Using Ensemble Machine Learning Algorithms
    Islam, Riadul
    [J]. IEEE CANADIAN JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, 2022, 45 (04): : 354 - 364
  • [8] Early Mortality Risk Prediction in Covid-19 Patients Using an Ensemble of Machine Learning Models
    Walia, Harsh
    Jeevaraj, S.
    [J]. 2021 INTERNATIONAL CONFERENCE ON COMPUTATIONAL PERFORMANCE EVALUATION (COMPE-2021), 2021, : 965 - 970
  • [9] A stacked ensemble machine learning approach for the prediction of diabetes
    Oliullah, Khondokar
    Rasel, Mahedi Hasan
    Islam, Md. Manzurul
    Islam, Md. Reazul
    Wadud, Md. Anwar Hussen
    Whaiduzzaman, Md.
    [J]. JOURNAL OF DIABETES AND METABOLIC DISORDERS, 2023, 23 (1) : 603 - 617
  • [10] Prediction of Diabetes at Early Stage using Interpretable Machine Learning
    Islam, Mohammad Sajidul
    Alam, Md Minul
    Ahamed, Afsana
    Meerza, Syed Imran Ali
    [J]. SOUTHEASTCON 2023, 2023, : 261 - 265