Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences

被引:34
|
作者
Guo, Gongde [1 ,2 ]
Chen, Lifei [1 ,2 ]
Ye, Yanfang [3 ]
Jiang, Qingshan [4 ]
机构
[1] Fujian Normal Univ, Sch Math & Comp Sci, Fuzhou 350117, Fujian, Peoples R China
[2] Fujian Normal Univ, Fujian Prov Key Lab Network Secur & Cryptol, Fuzhou 350007, Fujian, Peoples R China
[3] W Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen Key Lab High Performance Data Min, Shenzhen 518055, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Categorical sequences; cluster validation; cluster validity index (CVI); data clustering; model selection; robust clustering; MODEL SELECTION; ALGORITHMS; NETWORKS;
D O I
10.1109/TNNLS.2016.2608354
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.
引用
收藏
页码:2936 / 2948
页数:13
相关论文
共 50 条
  • [31] Automatically Determining the Number of Clusters in Unlabeled Data Sets
    Wang, Liang
    Leckie, Christopher
    Ramamohanarao, Kotagiri
    Bezdek, James
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2009, 21 (03) : 335 - 350
  • [32] Fuzzy clustering algorithm for automatically determining the number of clusters
    Hu Yangyang
    Liu Zengli
    [J]. CONFERENCE PROCEEDINGS OF 2019 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (IEEE ICSPCC 2019), 2019,
  • [33] Thresher: determining the number of clusters while removing outliers
    Min Wang
    Zachary B. Abrams
    Steven M. Kornblau
    Kevin R. Coombes
    [J]. BMC Bioinformatics, 19
  • [34] Effects of Resampling in Determining the Number of Clusters in a Data Set
    Rainer Dangl
    Friedrich Leisch
    [J]. Journal of Classification, 2020, 37 : 558 - 583
  • [35] A Morphology Method for Determining the Number of Clusters Present in Spectral Co-clustering Documents and Words
    Liu, Na
    Lu, Mingyu
    [J]. COMPUTATIONAL GEOMETRY, GRAPHS AND APPLICATIONS, 2011, 7033 : 130 - +
  • [36] Thresher: determining the number of clusters while removing outliers
    Wang, Min
    Abrams, Zachary B.
    Kornblau, StevenM.
    Coombes, Kevin R.
    [J]. BMC BIOINFORMATICS, 2018, 19
  • [37] Trail-and-error approach for determining the number of clusters
    Sun, Haojun
    Sun, Mei
    [J]. ADVANCES IN MACHINE LEARNING AND CYBERNETICS, 2006, 3930 : 229 - 238
  • [38] Determining the number of clusters using the weighted gap statistic
    Yan, Mingjin
    Ye, Keying
    [J]. BIOMETRICS, 2007, 63 (04) : 1031 - 1037
  • [39] Effects of Resampling in Determining the Number of Clusters in a Data Set
    Dangl, Rainer
    Leisch, Friedrich
    [J]. JOURNAL OF CLASSIFICATION, 2020, 37 (03) : 558 - 583
  • [40] Determining the Correct Number of Clusters in the CT Image Segmentation
    Li, Qi
    Yue, Shihong
    Ding, Mingliang
    Li, Jia
    Wang, Zeying
    [J]. JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS, 2020, 10 (11) : 2675 - 2680