A method for k-means-like clustering of categorical data

被引:23
|
作者
Nguyen T.-H.T. [1 ]
Dinh D.-T. [1 ]
Sriboonchitta S. [2 ]
Huynh V.-N. [1 ]
机构
[1] Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Ishikawa, Nomi
[2] Faculty of Economics, Centre of Excellence in Econometrics, Chiang Mai University, Chiang Mai
关键词
Categorical data; Cluster analysis; Clustering; Dissimilarity measures; k-Means;
D O I
10.1007/s12652-019-01445-5
中图分类号
学科分类号
摘要
Despite recent efforts, the challenge in clustering categorical and mixed data in the context of big data still remains due to the lack of inherently meaningful measure of similarity between categorical objects and the high computational complexity of existing clustering techniques. While k-means method is well known for its efficiency in clustering large data sets, working only on numerical data prohibits it from being applied for clustering categorical data. In this paper, we aim to develop a novel extension of k-means method for clustering categorical data, making use of an information theoretic-based dissimilarity measure and a kernel-based method for representation of cluster means for categorical objects. Such an approach allows us to formulate the problem of clustering categorical data in the fashion similar to k-means clustering, while a kernel-based definition of centers also provides an interpretation of cluster means being consistent with the statistical interpretation of the cluster means for numerical data. In order to demonstrate the performance of the new clustering method, a series of experiments on real datasets from UCI Machine Learning Repository are conducted and the obtained results are compared with several previously developed algorithms for clustering categorical data. © 2019, Springer-Verlag GmbH Germany, part of Springer Nature.
引用
收藏
页码:15011 / 15021
页数:10
相关论文
共 50 条
  • [1] A K-means-like algorithm for informetric data clustering
    Cena, Anna
    Gagolewski, Marek
    [J]. PROCEEDINGS OF THE 2015 CONFERENCE OF THE INTERNATIONAL FUZZY SYSTEMS ASSOCIATION AND THE EUROPEAN SOCIETY FOR FUZZY LOGIC AND TECHNOLOGY, 2015, 89 : 536 - 543
  • [2] A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure
    Thu-Hien Thi Nguyen
    Van-Nam Huynh
    [J]. FOUNDATIONS OF INFORMATION AND KNOWLEDGE SYSTEMS (FOIKS 2016), 2016, 9616 : 115 - 130
  • [3] FAST K-MEANS-LIKE CLUSTERING IN METRIC-SPACES
    JUAN, A
    VIDAL, E
    [J]. PATTERN RECOGNITION LETTERS, 1994, 15 (01) : 19 - 25
  • [4] K-Means Extensions for Clustering Categorical Data
    Alwersh, Mohammed
    Kovacs, Laszlo
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (09) : 492 - 507
  • [5] A modified K-means algorithm for categorical data clustering
    Sun, Y
    Zhu, QM
    Chen, ZX
    [J]. IC-AI'2000: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 1-III, 2000, : 31 - 37
  • [6] Extensions to the k-means algorithm for clustering large data sets with categorical values
    Huang, ZX
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 1998, 2 (03) : 283 - 304
  • [7] Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
    Zhexue Huang
    [J]. Data Mining and Knowledge Discovery, 1998, 2 : 283 - 304
  • [8] A data labeling method for clustering categorical data
    Cao, Fuyuan
    Liang, Jiye
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (03) : 2381 - 2385
  • [9] Data Reduction Method for Categorical Data Clustering
    Rendon, Erendira
    Salvador Sanchez, J.
    Garcia, Rene A.
    Abundez, Itzel
    Gutierrez, Citlalih
    Gasca, Eduardo
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA 2008, PROCEEDINGS, 2008, 5290 : 143 - +
  • [10] A Clustering Method for Categorical Ordinal Data
    Giordan, Marco
    Diana, Giancarlo
    [J]. COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 2011, 40 (07) : 1315 - 1334