Cluster Validation Method for Determining the Number of Clusters in Categorical Sequences

被引:34
|
作者
Guo, Gongde [1 ,2 ]
Chen, Lifei [1 ,2 ]
Ye, Yanfang [3 ]
Jiang, Qingshan [4 ]
机构
[1] Fujian Normal Univ, Sch Math & Comp Sci, Fuzhou 350117, Fujian, Peoples R China
[2] Fujian Normal Univ, Fujian Prov Key Lab Network Secur & Cryptol, Fuzhou 350007, Fujian, Peoples R China
[3] W Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
[4] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen Key Lab High Performance Data Min, Shenzhen 518055, Peoples R China
基金
美国国家科学基金会; 中国国家自然科学基金;
关键词
Categorical sequences; cluster validation; cluster validity index (CVI); data clustering; model selection; robust clustering; MODEL SELECTION; ALGORITHMS; NETWORKS;
D O I
10.1109/TNNLS.2016.2608354
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cluster validation, which is the process of evaluating the quality of clustering results, plays an important role for practical machine learning systems. Categorical sequences, such as biological sequences in computational biology, have become common in real-world applications. Different from previous studies, which mainly focused on attribute-value data, in this paper, we work on the cluster validation problem for categorical sequences. The evaluation of sequences clustering is currently difficult due to the lack of an internal validation criterion defined with regard to the structural features hidden in sequences. To solve this problem, in this paper, a novel cluster validity index (CVI) is proposed as a function of clustering, with the intracluster structural compactness and intercluster structural separation linearly combined to measure the quality of sequence clusters. A partition-based algorithm for robust clustering of categorical sequences is also proposed, which provides the new measure with high-quality clustering results by the deterministic initialization and the elimination of noise clusters using an information theoretic method. The new clustering algorithm and the CVI are then assembled within the common model selection procedure to determine the number of clusters in categorical sequence sets. A case study on commonly used protein sequences and the experimental results on some real-world sequence sets from different domains are given to demonstrate the performance of the proposed method.
引用
收藏
页码:2936 / 2948
页数:13
相关论文
共 50 条
  • [1] A method of dynamically determining the number of clusters and cluster centers
    Shao Xiongkai
    Pi Ling
    Liu Lianzhou
    [J]. PROCEEDINGS OF THE 2013 8TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION (ICCSE 2013), 2013, : 283 - 286
  • [2] Determining the number of clusters in cluster analysis
    My-Young Cheong
    Hakbae Lee
    [J]. Journal of the Korean Statistical Society, 2008, 37 : 135 - 143
  • [3] Determining the number of clusters in cluster analysis
    Cheong, My-Young
    Lee, Hakbae
    [J]. JOURNAL OF THE KOREAN STATISTICAL SOCIETY, 2008, 37 (02) : 135 - 143
  • [4] DETERMINING THE OPTIMAL NUMBER OF CLUSTERS IN CLUSTER ANALYSIS
    Loster, Tomas
    [J]. 10TH INTERNATIONAL DAYS OF STATISTICS AND ECONOMICS, 2016, : 1078 - 1090
  • [5] An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data
    Bai, Liang
    Liang, Jiye
    Dang, Chuangyin
    [J]. KNOWLEDGE-BASED SYSTEMS, 2011, 24 (06) : 785 - 795
  • [6] A cluster validity evaluation method for dynamically determining the near-optimal number of clusters
    Li, Xiangjun
    Liang, Wei
    Zhang, Xinping
    Qing, Song
    Chang, Pei-Chann
    [J]. SOFT COMPUTING, 2020, 24 (12) : 9227 - 9241
  • [7] A cluster validity evaluation method for dynamically determining the near-optimal number of clusters
    Xiangjun Li
    Wei Liang
    Xinping Zhang
    Song Qing
    Pei-Chann Chang
    [J]. Soft Computing, 2020, 24 : 9227 - 9241
  • [8] ON CLUSTER VALIDATION FOR DETECTING THE NUMBER OF CLUSTERS IN A DATA SET
    Albalate, Amparo
    Suendermann, David
    Minker, Wolfgang
    [J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2011, 20 (05) : 941 - 953
  • [9] A Method for Automatically Determining The Number of Clusters of LAC
    Liu, Han
    Wu, Qingfeng
    Dong, Huailin
    Wang, Shuangshuang
    Cai, Qing
    Ma, Zhuo
    [J]. ICCSSE 2009: PROCEEDINGS OF 2009 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, 2009, : 1907 - +
  • [10] A new validation index for determining the number of clusters in a data set
    Sun, HJ
    Wang, SG
    Jiang, QS
    [J]. IJCNN'01: INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, PROCEEDINGS, 2001, : 1852 - 1857