Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud

被引:15
|
作者
Bu, Fanyu [1 ,2 ]
Chen, Zhikui [1 ]
Zhang, Qingchen [1 ]
Yang, Laurence T. [3 ]
机构
[1] Dalian Univ Technol, Sch Software Technol, Dalian 116620, Peoples R China
[2] Inner Mongolia Univ Finance & Econ, Coll Vocat, Hohhot 010010, Peoples R China
[3] St Francis Xavier Univ, Dept Comp Sci, Antigonish, NS B2G 2W5, Canada
来源
JOURNAL OF SUPERCOMPUTING | 2016年 / 72卷 / 08期
关键词
High-dimensional data; Incomplete data imputation; Feature subset selection; Clustering analysis; SUPPORT VECTOR REGRESSION; C-MEANS; VALUES;
D O I
10.1007/s11227-015-1433-9
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Incomplete data imputation plays an important role in big data analysis and smart computing. Existing algorithms are of low efficiency and effectiveness in imputing incomplete high-dimensional data. The paper proposes an incomplete high-dimensional data imputation algorithm based on feature selection and cluster analysis (IHDIFC), which works in three steps. First, a hierarchical clustering-based feature subset selection algorithm is designed to reduce the dimensions of the data set. Second, a parallel -means algorithm based on partial distance is derived to cluster the selected data subset efficiently. Finally, the data objects in the same cluster with the target are utilized to estimate its missing feature values. Extensive experiments are carried out to compare IHDIFC to two representative missing data imputation algorithms, namely FIMUS and DMI. The results demonstrate that the proposed algorithm achieves better imputation accuracy and takes significantly less time than other algorithms for imputing high-dimensional data.
引用
收藏
页码:2977 / 2990
页数:14
相关论文
共 50 条
  • [21] Improving Evolutionary Algorithm Performance for Feature Selection in High-Dimensional Data
    Cilia, N.
    De Stefano, C.
    Fontanella, F.
    di Freca, A. Scotto
    APPLICATIONS OF EVOLUTIONARY COMPUTATION, EVOAPPLICATIONS 2018, 2018, 10784 : 439 - 454
  • [22] Imputation for incomplete high-dimensional multivariate normal data using a common factor model
    Song, JW
    Belin, TR
    STATISTICS IN MEDICINE, 2004, 23 (18) : 2827 - 2843
  • [23] Differential Privacy High-Dimensional Data Publishing Based on Feature Selection and Clustering
    Chu, Zhiguang
    He, Jingsha
    Zhang, Xiaolei
    Zhang, Xing
    Zhu, Nafei
    ELECTRONICS, 2023, 12 (09)
  • [24] Diagonal Discriminant Analysis With Feature Selection for High-Dimensional Data
    Romanes, Sarah E.
    Ormerod, John T.
    Yang, Jean Y. H.
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2020, 29 (01) : 114 - 127
  • [25] Multiple imputation for high-dimensional mixed incomplete continuous and binary data
    He, Ren
    Belin, Thomas
    STATISTICS IN MEDICINE, 2014, 33 (13) : 2251 - 2262
  • [26] Subspace selection for clustering high-dimensional data
    Baumgartner, C
    Plant, C
    Kailing, K
    Kriegel, HP
    Kröger, P
    FOURTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2004, : 11 - 18
  • [27] Automated Clustering of High-dimensional Data with a Feature Weighted Mean Shift Algorithm
    Chakraborty, Saptarshi
    Paul, Debolina
    Das, Swagatam
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 6930 - 6938
  • [28] Feature selection for high-dimensional data in astronomy
    Zheng, Hongwen
    Zhang, Yanxia
    ADVANCES IN SPACE RESEARCH, 2008, 41 (12) : 1960 - 1964
  • [29] Feature selection for high-dimensional imbalanced data
    Yin, Liuzhi
    Ge, Yong
    Xiao, Keli
    Wang, Xuehua
    Quan, Xiaojun
    NEUROCOMPUTING, 2013, 105 : 3 - 11
  • [30] A filter feature selection for high-dimensional data
    Janane, Fatima Zahra
    Ouaderhman, Tayeb
    Chamlal, Hasna
    JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY, 2023, 17