Stratified feature sampling method for ensemble clustering of high dimensional data

被引:53
|
作者
Jing, Liping [1 ]
Tian, Kuang [1 ]
Huang, Joshua Z. [2 ]
机构
[1] Beijing Jiaotong Univ, Beijing Key Lab Traff Data Anal & Min, Beijing, Peoples R China
[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Stratified sampling; Ensemble clustering; High dimensional data; Consensus function; CLASS DISCOVERY; CLASSIFICATION; PREDICTION; CONSENSUS; SELECTION; CANCER;
D O I
10.1016/j.patcog.2015.05.006
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
High dimensional data with thousands of features present a big challenge to current clustering algorithms. Sparsity, noise and correlation of features are common characteristics of such data. Another common phenomenon is that clusters in such high dimensional data often exist in different subspaces. Ensemble clustering is emerging as a prominent technique for improving robustness, stability and accuracy of high dimensional data clustering. In this paper, we propose a stratified sampling method for generating subspace component data sets in ensemble clustering of high dimensional data. Instead of randomly sampling a subset of features for each component data set, in this method we first cluster the features of high dimensional data into a few feature groups called feature strata. Using stratified sampling, we randomly sample some features from each feature stratum and merge the sampled features from different feature strata to generate a component data set. In this way, the component data sets have better representations of the clustering structure in the original data set. Comparing with random sampling and random projection methods in synthetic data analysis, the component clustering by stratified sampling has demonstrated that the average clustering accuracy was increased without sacrificing clustering diversity. We carried out a series of experiments on eight real world data sets from microarray, text and image domains to evaluate ensemble clustering methods using three subspace component data generation methods and four consensus functions. The experimental results consistently showed that the stratified sampling method produced the best ensemble clustering results in all data sets. The ensemble clustering with stratified sampling also outperformed three other ensemble clustering methods which generate component clusters from the entire space of the original data. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3688 / 3702
页数:15
相关论文
共 50 条
  • [1] Stratified Feature Sampling for Semi-Supervised Ensemble Clustering
    Tian, Jialin
    Ren, Yazhou
    Cheng, Xiang
    [J]. IEEE ACCESS, 2019, 7 : 128669 - 128675
  • [2] A Feature Grouping Method for Ensemble Clustering of High-Dimensional Genomic Big Data
    Farid, Dewan Md.
    Nowe, Ann
    Manderick, Bernard
    [J]. PROCEEDINGS OF 2016 FUTURE TECHNOLOGIES CONFERENCE (FTC), 2016, : 260 - 268
  • [3] Stratified sampling for feature subspace selection in random forests for high dimensional data
    Ye, Yunming
    Wu, Qingyao
    Huang, Joshua Zhexue
    Ng, Michael K.
    Li, Xutao
    [J]. PATTERN RECOGNITION, 2013, 46 (03) : 769 - 787
  • [4] Ensemble Method Using Correlation Based Feature Selection with Stratified Sampling for Classification
    Meshram, Shweta B.
    Shinde, Sharmila M.
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 1, 2017, 468 : 47 - 55
  • [5] Feature Selection for Clustering on High Dimensional Data
    Zeng, Hong
    Cheung, Yiu-ming
    [J]. PRICAI 2008: TRENDS IN ARTIFICIAL INTELLIGENCE, 2008, 5351 : 913 - 922
  • [6] Ensemble feature selection for high dimensional data: a new method and a comparative study
    Afef Ben Brahim
    Mohamed Limam
    [J]. Advances in Data Analysis and Classification, 2018, 12 : 937 - 952
  • [7] Ensemble feature selection for high dimensional data: a new method and a comparative study
    Ben Brahim, Afef
    Limam, Mohamed
    [J]. ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2018, 12 (04) : 937 - 952
  • [8] Ensemble Clustering of High Dimensional Data with FastMap Projection
    Khan, Imran
    Huang, Joshua Zhexue
    Nguyen Thanh Tung
    Williams, Graham
    [J]. TRENDS AND APPLICATIONS IN KNOWLEDGE DISCOVERY AND DATA MINING, 2014, 8643 : 483 - 493
  • [9] A feature group weighting method for subspace clustering of high-dimensional data
    Chen, Xiaojun
    Ye, Yunming
    Xu, Xiaofei
    Huang, Joshua Zhexue
    [J]. PATTERN RECOGNITION, 2012, 45 (01) : 434 - 446
  • [10] A New Ensemble Method with Feature Space Partitioning for High-Dimensional Data Classification
    Piao, Yongjun
    Piao, Minghao
    Jin, Cheng Hao
    Shon, Ho Sun
    Chung, Ji-Moon
    Hwang, Buhyun
    Ryu, Keun Ho
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2015, 2015