Clustering categorical data in projected spaces

被引:12
|
作者
Bouguessa, Mohamed [1 ]
机构
[1] Univ Quebec, Dept Comp Sci, Montreal, PQ H3C 3P8, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Projected clustering; Categorical data; High dimensions; Mixture model; HIGH-DIMENSIONAL DATA; DETECTION STRATEGY; OUTLIER DETECTION; BETA-MIXTURE; ALGORITHM;
D O I
10.1007/s10618-013-0336-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of clustering categorical data has been widely investigated and appropriate approaches have been proposed. However, the majority of the existing methods suffer from one or more of the following limitations: (1) difficulty detecting clusters of very low dimensionality embedded in high-dimensional spaces, (2) lack of an automatic mechanism for identifying relevant dimensions for each cluster, (3) lack of an outlier detection mechanism and (4) dependence on a set of parameters that need to be properly tuned. Most of the existing approaches are inadequate for dealing with these four issues in a unified framework. This motivates our effort to propose a fully automatic projected clustering algorithm for high-dimensional categorical data which is capable of facing the four aforementioned issues in a single framework. Our algorithm comprises two phases: (1) outlier handling and (2) clustering in projected spaces. The first phase of the algorithm is based on a probabilistic approach that exploits the beta mixture model to identify and eliminate outlier objects from a data set in a systematic way. In the second phase, the clustering process is based on a novel quality function that allows the identification of projected clusters of low dimensionality embedded in a high-dimensional space without any parameter setting by the user. The suitability of our proposal is demonstrated through empirical studies using synthetic and real data sets.
引用
收藏
页码:3 / 38
页数:36
相关论文
共 50 条
  • [1] Clustering categorical data in projected spaces
    Mohamed Bouguessa
    [J]. Data Mining and Knowledge Discovery, 2015, 29 : 3 - 38
  • [2] Projected clustering for categorical datasets
    Kim, Minho
    Ramakrishna, R. S.
    [J]. PATTERN RECOGNITION LETTERS, 2006, 27 (12) : 1405 - 1417
  • [3] Possibilistic Projected Categorical Clustering via Cluster Cores
    Matthews, Stephen G.
    Martin, Trevor P.
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2014, : 2063 - 2070
  • [4] On data labeling for clustering categorical data
    Chen, Hung-Leng
    Chuang, Kun-Ta
    Chen, Ming-Syan
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (11) : 1458 - 1471
  • [5] Clustering categorical data streams
    He, Zengyou
    Xu, Xiaofei
    Deng, Shengchun
    Huang, Joshua Zhexue
    [J]. JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2011, 11 (04) : 185 - 192
  • [6] Evaluation of Categorical Data Clustering
    Rezankova, Hana
    Loster, Tomas
    Husek, Dusan
    [J]. ADVANCES IN INTELLIGENT WEB MASTERING 3, 2011, 86 : 173 - 182
  • [7] Subtractive Clustering for Categorical Data
    Gu, Lei
    [J]. 2016 12TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2016, : 1229 - 1232
  • [8] Clustering Categorical Data: A Survey
    Naouali, Sami
    Ben Salem, Semeh
    Chtourou, Zied
    [J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING, 2020, 19 (01) : 49 - 96
  • [9] A data labeling method for clustering categorical data
    Cao, Fuyuan
    Liang, Jiye
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (03) : 2381 - 2385
  • [10] Data Reduction Method for Categorical Data Clustering
    Rendon, Erendira
    Salvador Sanchez, J.
    Garcia, Rene A.
    Abundez, Itzel
    Gutierrez, Citlalih
    Gasca, Eduardo
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA 2008, PROCEEDINGS, 2008, 5290 : 143 - +