Automatically Determining the Number of Clusters in Unlabeled Data Sets

被引:62
|
作者
Wang, Liang [1 ]
Leckie, Christopher [1 ]
Ramamohanarao, Kotagiri [1 ]
Bezdek, James [1 ]
机构
[1] Univ Melbourne, Dept Comp Sci & Software Engn, Parkville, Vic 3010, Australia
基金
澳大利亚研究理事会;
关键词
Clustering; cluster tendency; reordered dissimilarity image; VAT; VISUAL ASSESSMENT; TENDENCY; SELECTION; AID;
D O I
10.1109/TKDE.2008.158
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
One of the major problems in cluster analysis is the determination of the number of clusters in unlabeled data, which is a basic input for most clustering algorithms. In this paper, we investigate a new method called Dark Block Extraction (DBE) for automatically estimating the number of clusters in unlabeled data sets, which is based on an existing algorithm for Visual Assessment of Cluster Tendency (VAT) of a data set, using several common image and signal processing techniques. Its basic steps include 1) generating a VAT image of an input dissimilarity matrix, 2) performing image segmentation on the VAT image to obtain a binary image, followed by directional morphological filtering, 3) applying a distance transform to the filtered binary image and projecting the pixel values onto the main diagonal axis of the image to form a projection signal, and 4) smoothing the projection signal, computing its first-order derivative, and then detecting major peaks and valleys in the resulting signal to decide the number of clusters. Our DBE method is nearly "automatic," depending on just one easy-to-set parameter. Several numerical and real-world examples are presented to illustrate the effectiveness of DBE.
引用
收藏
页码:335 / 350
页数:16
相关论文
共 50 条
  • [41] Finding the number of natural clusters in groundwater data sets using the concept of equivalence class
    Pacheco, FAL
    COMPUTERS & GEOSCIENCES, 1998, 24 (01) : 7 - 15
  • [42] Recovering the number of clusters in data sets with noise features using feature rescaling factors
    de Amorim, Renato Cordeiro
    Hennig, Christian
    INFORMATION SCIENCES, 2015, 324 : 126 - 145
  • [43] DETERMINING APPROPRIATE GROUP NUMBER AND COMPOSITION FOR DATA SETS CONTAINING REPEATED CHECK CULTIVARS
    BULL, JK
    BASFORD, KE
    DELACY, IH
    COOPER, M
    FIELD CROPS RESEARCH, 1993, 31 (3-4) : 369 - 383
  • [44] Determining the number of signals correlated across multiple data sets for small sample support
    Song, Yang
    Hasija, Tanuj
    Schreier, Peter J.
    Ramirez, David
    2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2016, : 1528 - 1532
  • [45] (Automatic) Cluster Count Extraction from Unlabeled Data Sets
    Sledge, Isaac J.
    Huband, Jacalyn M.
    Bezdek, James C.
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 1, PROCEEDINGS, 2008, : 3 - +
  • [46] DETERMINING THE OPTIMAL NUMBER OF CLUSTERS IN CLUSTER ANALYSIS
    Loster, Tomas
    10TH INTERNATIONAL DAYS OF STATISTICS AND ECONOMICS, 2016, : 1078 - 1090
  • [47] A new similarity measure and its use in determining the number of clusters in a multivariate data set
    Vassiliou, A
    Tambouratzis, DG
    Koutras, MV
    Bersimis, S
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2004, 33 (07) : 1643 - 1666
  • [48] Exploring data sets for clusters and validating single clusters
    Klawonn, Frank
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS: PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE KES-2016, 2016, 96 : 1381 - 1390
  • [49] SMART: a subspace clustering algorithm that automatically identifies the appropriate number of clusters
    Jing, Liping
    Li, Junjie
    Ng, Michael K.
    Cheung, Yiu-ming
    Huang, Joshua
    INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2009, 1 (02) : 149 - 177
  • [50] Automatically Determining the Number of Affinity Propagation Clustering using Particle Swarm
    Wang, Xian-hui
    Zhang, Xuan-ping
    Zhuang, Chun-xiao
    Chen, Zu-ning
    Qin, Zheng
    ICIEA 2010: PROCEEDINGS OF THE 5TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS, VOL 3, 2010, : 374 - 378