Feature selection for k-means clustering stability: theoretical analysis and an algorithm

被引:12
|
作者
Mavroeidis, Dimitrios [1 ]
Marchiori, Elena [2 ]
机构
[1] IBM Res Ireland, Dublin 15, Ireland
[2] Radboud Univ Nijmegen, Dept Comp Sci, Fac Sci, NL-6525 AJ Nijmegen, Netherlands
关键词
Sparse PCA; Stability; Feature selection; Clustering;
D O I
10.1007/s10618-013-0320-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies that the derived models are robust with respect to the presence of noisy features and/or data sample fluctuations. The qualitative nature of the stability property enhardens the development of practical, stability optimizing, data mining algorithms as several issues naturally arise, such as: how "much" stability is enough, or how can stability be effectively associated with intrinsic data properties. In the context of this work we take into account these issues and explore the effect of stability maximization in the continuous (PCA-based) k-means clustering problem. Our analysis is based on both mathematical optimization and statistical arguments that complement each other and allow for the solid interpretation of the algorithm's stability properties. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the features variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means. Apart from the qualitative evaluation, we have also verified our approach as a feature selection method for -means clustering using four cancer research datasets. The quantitative empirical results illustrate the practical utility of our framework as a feature selection mechanism for clustering.
引用
收藏
页码:918 / 960
页数:43
相关论文
共 50 条
  • [41] Adaptive K-Means clustering algorithm
    Chen, Hailin
    Wu, Xiuqing
    Hu, Junhua
    [J]. MIPPR 2007: PATTERN RECOGNITION AND COMPUTER VISION, 2007, 6788
  • [42] Improved Algorithm for the k-means Clustering
    Zhang, Sheng
    Wang, Shouqiang
    [J]. PROCEEDINGS OF THE 10TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA 2012), 2012, : 4717 - 4720
  • [43] An Enhancement of K-means Clustering Algorithm
    Gu, Jirong
    Zhou, Jieming
    Chen, Xianwei
    [J]. 2009 INTERNATIONAL CONFERENCE ON BUSINESS INTELLIGENCE AND FINANCIAL ENGINEERING, PROCEEDINGS, 2009, : 237 - 240
  • [44] An accelerated K-means clustering algorithm using selection and erasure rules
    Lee, Suiang-Shyan
    Lin, Ja-Chen
    [J]. JOURNAL OF ZHEJIANG UNIVERSITY-SCIENCE C-COMPUTERS & ELECTRONICS, 2012, 13 (10): : 761 - 768
  • [45] Improved Initial Clustering Center Selection Method for k-means Algorithm
    Xie, Qingqing
    Jiang, He
    Han, Bing
    Wang, Dongyuan
    [J]. 2018 EIGHTH INTERNATIONAL CONFERENCE ON INSTRUMENTATION AND MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC 2018), 2018, : 1092 - 1095
  • [46] K-Means Clustering Efficient Algorithm with Initial Class Center Selection
    Huang Suyu
    Hu Pingfang
    [J]. PROCEEDINGS OF THE 2018 3RD INTERNATIONAL WORKSHOP ON MATERIALS ENGINEERING AND COMPUTER SCIENCES (IWMECS 2018), 2018, 78 : 301 - 305
  • [47] An accelerated K-means clustering algorithm using selection and erasure rules
    Suiang-Shyan Lee
    Ja-Chen Lin
    [J]. Journal of Zhejiang University SCIENCE C, 2012, 13 : 761 - 768
  • [48] An accelerated K-means clustering algorithm using selection and erasure rules
    Suiang-Shyan LEE
    Ja-Chen LIN
    [J]. JournalofZhejiangUniversity-ScienceC(Computers&Electronics), 2012, 13 (10) : 761 - 768
  • [49] Initial Centroid Selection Method for an Enhanced K-means Clustering Algorithm
    Aamer, Youssef
    Benkaouz, Yahya
    Ouzzif, Mohammed
    Bouragba, Khalid
    [J]. UBIQUITOUS NETWORKING, UNET 2019, 2020, 12293 : 182 - 190
  • [50] Agglomerative fuzzy K-Means clustering algorithm with selection of number of clusters
    Li, Mark Junjie
    Ng, Michael K.
    Cheung, Yiu-ming
    Huang, Joshua Zhexue
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (11) : 1519 - 1534