Incomplete high-dimensional data imputation algorithm using feature selection and clustering analysis on cloud

被引:15
|
作者
Bu, Fanyu [1 ,2 ]
Chen, Zhikui [1 ]
Zhang, Qingchen [1 ]
Yang, Laurence T. [3 ]
机构
[1] Dalian Univ Technol, Sch Software Technol, Dalian 116620, Peoples R China
[2] Inner Mongolia Univ Finance & Econ, Coll Vocat, Hohhot 010010, Peoples R China
[3] St Francis Xavier Univ, Dept Comp Sci, Antigonish, NS B2G 2W5, Canada
来源
JOURNAL OF SUPERCOMPUTING | 2016年 / 72卷 / 08期
关键词
High-dimensional data; Incomplete data imputation; Feature subset selection; Clustering analysis; SUPPORT VECTOR REGRESSION; C-MEANS; VALUES;
D O I
10.1007/s11227-015-1433-9
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Incomplete data imputation plays an important role in big data analysis and smart computing. Existing algorithms are of low efficiency and effectiveness in imputing incomplete high-dimensional data. The paper proposes an incomplete high-dimensional data imputation algorithm based on feature selection and cluster analysis (IHDIFC), which works in three steps. First, a hierarchical clustering-based feature subset selection algorithm is designed to reduce the dimensions of the data set. Second, a parallel -means algorithm based on partial distance is derived to cluster the selected data subset efficiently. Finally, the data objects in the same cluster with the target are utilized to estimate its missing feature values. Extensive experiments are carried out to compare IHDIFC to two representative missing data imputation algorithms, namely FIMUS and DMI. The results demonstrate that the proposed algorithm achieves better imputation accuracy and takes significantly less time than other algorithms for imputing high-dimensional data.
引用
收藏
页码:2977 / 2990
页数:14
相关论文
共 50 条
  • [41] CLINCH: Clustering incomplete high-dimensional data for data mining application
    Cheng, ZP
    Zhou, D
    Wang, C
    Guo, JK
    Wang, W
    Ding, BK
    Shi, B
    WEB TECHNOLOGIES RESEARCH AND DEVELOPMENT - APWEB 2005, 2005, 3399 : 88 - 99
  • [42] SFE: A Simple, Fast, and Efficient Feature Selection Algorithm for High-Dimensional Data
    Ahadzadeh, Behrouz
    Abdar, Moloud
    Safara, Fatemeh
    Khosravi, Abbas
    Menhaj, Mohammad Bagher
    Suganthan, Ponnuthurai Nagaratnam
    IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2023, 27 (06) : 1896 - 1911
  • [43] FACO: A Novel Hybrid Feature Selection Algorithm for High-Dimensional Data Classification
    Popoola, Gideon
    Oyeniran, Kayode
    SOUTHEASTCON 2024, 2024, : 61 - 68
  • [44] BOSO: A novel feature selection algorithm for linear regression with high-dimensional data
    Valcarcel, Luis J.
    San Jose-Eneriz, Edurne L.
    Cendoya, Xabier
    Rubio, Angel L.
    Agirre, Xabier
    Prosper, Felipe L.
    Planes, Francisco
    PLOS COMPUTATIONAL BIOLOGY, 2022, 18 (05)
  • [45] Clustering Lines in High-Dimensional Space: Classification of Incomplete Data
    Gao, Jie
    Langberg, Michael
    Schulman, Leonard J.
    ACM TRANSACTIONS ON ALGORITHMS, 2010, 7 (01)
  • [46] A differential evolution based feature combination selection algorithm for high-dimensional data
    Guan, Boxin
    Zhao, Yuhai
    Yin, Ying
    Li, Yuan
    INFORMATION SCIENCES, 2021, 547 : 870 - 886
  • [47] Variable selection for high-dimensional incomplete data using horseshoe estimation with data augmentation
    Zhang, Yunxi
    Kim, Soeun
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2024, 53 (12) : 4235 - 4251
  • [48] Feature selection using autoencoders with Bayesian methods to high-dimensional data
    Shu, Lei
    Huang, Kun
    Jiang, Wenhao
    Wu, Wenming
    Liu, Hongling
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2021, 41 (06) : 7397 - 7406
  • [49] Feature Selection using Mutual Information for High-dimensional Data Sets
    Nagpal, Arpita
    Gaur, Deepti
    Gaur, Seema
    SOUVENIR OF THE 2014 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2014, : 45 - 49
  • [50] An Improved Mean Imputation Clustering Algorithm for Incomplete Data
    Shi, Hong
    Wang, Pingxin
    Yang, Xin
    Yu, Hualong
    NEURAL PROCESSING LETTERS, 2022, 54 (05) : 3537 - 3550