Clustering Categorical Data via Ensembling Dissimilarity Matrices

被引:13
|
作者
Amiri, Saeid [1 ]
Clarke, Bertrand S. [2 ]
Clarke, Jennifer L. [2 ]
机构
[1] Univ Wisconsin, Dept Nat & Appl Sci, Green Bay, WI 54302 USA
[2] Univ Nebraska, Dept Stat, Lincoln, NE 68588 USA
关键词
Categorical data; Classification and clustering; Hamming distance; High-dimensional data; Sequence alignment; Stability; MULTIPLE SEQUENCE ALIGNMENT; K-MEANS ALGORITHM; SELECTION; SET;
D O I
10.1080/10618600.2017.1305278
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
We present a technique for clustering categorical data by generating many dissimilarity matrices and combining them. We begin by demonstrating our technique on low-dimensional categorical data and comparing it to several other techniques that have been proposed. We show through simulations and examples that our method is both more accurate and more stable. Then we give conditions under which our method should yield good results in general. Our method extends to high-dimensional categorical data of equal lengths by ensembling over many choices of explanatory variables. In this context, we compare our method with two other methods. Finally, we extend our method to high-dimensional categorical data vectors of unequal length by using alignment techniques to equalize the lengths. We give an example to show that our method continues to provide useful results, in particular, providing a comparison with phylogenetic trees. Supplementary material for this article is available online.
引用
收藏
页码:195 / 208
页数:14
相关论文
共 50 条
  • [1] EnsCat: clustering of categorical data via ensembling
    Clarke, Bertrand S.
    Amiri, Saeid
    Clarke, Jennifer L.
    [J]. BMC BIOINFORMATICS, 2016, 17
  • [2] EnsCat: clustering of categorical data via ensembling
    Bertrand S. Clarke
    Saeid Amiri
    Jennifer L. Clarke
    [J]. BMC Bioinformatics, 17
  • [3] A Comparative Analysis of Dissimilarity Measures for Clustering Categorical Data
    Xavierr-Junior, Joao C.
    Canuto, Anne M. P.
    Almeida, Noriedson D.
    Goncalves, Luiz M. G.
    [J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
  • [4] Learning-Based Dissimilarity for Clustering Categorical Data
    Rivera Rios, Edgar Jacob
    Angel Medina-Perez, Miguel
    Lazo-Cortes, Manuel S.
    Monroy, Raul
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (08):
  • [5] From Context to Distance: Learning Dissimilarity for Categorical Data Clustering
    Ienco, Dino
    Pensa, Ruggero G.
    Meo, Rosa
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2012, 6 (01)
  • [6] Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data
    Lee, Changki
    Jung, Uk
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (18):
  • [7] An effective dissimilarity measure for clustering of high-dimensional categorical data
    Lee, Jeonghoon
    Lee, Yoon-Joon
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2014, 38 (03) : 743 - 757
  • [8] An effective dissimilarity measure for clustering of high-dimensional categorical data
    Jeonghoon Lee
    Yoon-Joon Lee
    [J]. Knowledge and Information Systems, 2014, 38 : 743 - 757
  • [9] Graph Enhanced Fuzzy Clustering for Categorical Data Using a Bayesian Dissimilarity Measure
    Zhang, Chuanbin
    Chen, Long
    Zhao, Yin-Ping
    Wang, Yingxu
    Chen, C. L. Philip
    [J]. IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2023, 31 (03) : 810 - 824
  • [10] Clustering in Ordered Dissimilarity Data
    Havens, Timothy C.
    Bezdek, James C.
    Keller, James M.
    Popescu, Mihail
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2009, 24 (05) : 504 - 528