Missing data imputation with fuzzy feature selection for diabetes dataset

被引：25

作者：

Dzulkalnine, Mohamad Faiz ^{[1
]}

Sallehuddin, Roselina ^{[1
]}

机构：

[1] Univ Teknol Malaysia, Fac Comp, Skudai 81300, Johor, Malaysia

来源：

SN APPLIED SCIENCES | 2019年 / 1卷 / 04期

关键词：

Missing data; Fuzzy feature selection; Imputation; Classification; SUPPORT VECTOR MACHINES; ALGORITHMS; MODEL;

D O I：

10.1007/s42452-019-0383-x

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Missing data in datasets remain as a difficulty in terms of data analysis in various research fields, especially in the medical field, as it affects the treatment and diagnosis that the patient should receive. In this research, Fuzzy c-means (FCM) are used to impute the missing data. However, like in most data imputation methods, FCM do not consider the presence of irrelevant features. Irrelevant features can increase the computational time of the imputation process and decrease the accuracy of the prediction. Feature selection techniques can alleviate this problem by selecting the most relevant features and reducing the dataset size. Fuzzy principal component analysis (FPCA) is used as the feature selection method in this study as it considers the presence of outliers compared to classical PCA as outliers are the main reason some features renders irrelevant. Therefore, an improved hybrid imputation model of FPCA-Support vector machines-FCM (FPCA-SVM-FCM) has been proposed and employed in this study. The efficiency of the proposed model is investigated on one dataset which is Pima Indians Diabetes dataset. Experimental results showed that the proposed hybrid imputation model is better than the existing methods by producing a more accurate estimation in terms of accuracy, RMSE and MAE. The proposed method was also validated by using Wilcoxon rank sum and Theirs U test and obtained good results compared to SVM-FCM. Therefore, it can be used as an alternative tool for handling missing data in order to obtain a better quality dataset.

引用

页数：12

共 50 条

[41] MISSING DATA, IMPUTATION, AND THE BOOTSTRAP
EFRON, B
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1994, 89 (426) : 463 - 475
[42] Missing data, imputation, and endogeneity
McDonough, Ian K.
Millimet, Daniel L.
JOURNAL OF ECONOMETRICS, 2017, 199 (02) : 141 - 155
[43] Comparison of Imputation Methods Based on Missing Value Detection for Multidimensional Feature Data
Qiao F.
Zhai X.
Wang Q.
Tongji Daxue Xuebao/Journal of Tongji University, 2023, 51 (12): : 1972 - 1982
[44] Using Temporal Feature Aggregation and Gradient Boosting Tree on Missing Data Imputation
Kang, Yanni
Jia, Xiaoyu
Li, Xiang
Xie, Guotong
2019 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI), 2019, : 565 - 566
[45] BAYESIAN IMPUTATION FOR MISSING DATA
Nads, Azman A.
Polestico, Daisy Lou L.
ADVANCES AND APPLICATIONS IN STATISTICS, 2022, 79 : 83 - 104
[46] Imputation of Missing Healthcare Data
Chowdhury, Mohaimanul Hoque
Islam, Muhammad Kamrul
Khan, Shahidul Islam
2017 20TH INTERNATIONAL CONFERENCE OF COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2017,
[47] Multiple imputation for missing data
Patrician, PA
RESEARCH IN NURSING & HEALTH, 2002, 25 (01) : 76 - 84
[48] Imputation of missing data in surveys
Rässler, S
JAHRBUCHER FUR NATIONALOKONOMIE UND STATISTIK, 2000, 220 (01): : 64 - 94
[49] Multiple imputation of missing data
Lydersen, Stian
TIDSSKRIFT FOR DEN NORSKE LAEGEFORENING, 2022, 142 (02) : 151 - 151
[50] Missing Value Imputation for Diabetes Prediction
Luo, Fei
Qian, Hangwei
Wang, Di
Guo, Xu
Sun, Yan
Lee, Eng Sing
Teong, Hui Hwang
Lai, Ray Tian Rui
Miao, Chunyan
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,

← 1 2 3 4 5 →