Learning-Based Dissimilarity for Clustering Categorical Data

被引:4
|
作者
Rivera Rios, Edgar Jacob [1 ]
Angel Medina-Perez, Miguel [1 ]
Lazo-Cortes, Manuel S. [2 ]
Monroy, Raul [1 ]
机构
[1] Tecnol Monterrey, Sch Sci & Engn, Estado De Mexico 52926, Mexico
[2] TecNM Inst Tecnol Tlalnepantla, Tlalnepantla De Baz 54070, Mexico
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 08期
关键词
dissimilarity; categorical data; clustering;
D O I
10.3390/app11083509
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call Learning-Based Dissimilarity, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] From Context to Distance: Learning Dissimilarity for Categorical Data Clustering
    Ienco, Dino
    Pensa, Ruggero G.
    Meo, Rosa
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2012, 6 (01)
  • [2] Context-Based Geodesic Dissimilarity Measure for Clustering Categorical Data
    Lee, Changki
    Jung, Uk
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (18):
  • [3] A Comparative Analysis of Dissimilarity Measures for Clustering Categorical Data
    Xavierr-Junior, Joao C.
    Canuto, Anne M. P.
    Almeida, Noriedson D.
    Goncalves, Luiz M. G.
    [J]. 2013 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2013,
  • [4] Clustering Categorical Data via Ensembling Dissimilarity Matrices
    Amiri, Saeid
    Clarke, Bertrand S.
    Clarke, Jennifer L.
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2018, 27 (01) : 195 - 208
  • [5] An effective dissimilarity measure for clustering of high-dimensional categorical data
    Lee, Jeonghoon
    Lee, Yoon-Joon
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2014, 38 (03) : 743 - 757
  • [6] An effective dissimilarity measure for clustering of high-dimensional categorical data
    Jeonghoon Lee
    Yoon-Joon Lee
    [J]. Knowledge and Information Systems, 2014, 38 : 743 - 757
  • [7] Context-Based Distance Learning for Categorical Data Clustering
    Ienco, Dino
    Pensa, Ruggero G.
    Meo, Rosa
    [J]. ADVANCES IN INTELLIGENT DATA ANALYSIS VIII, PROCEEDINGS, 2009, 5772 : 83 - 94
  • [8] Incremental learning based multiobjective fuzzy clustering for categorical data
    Saha, Indrajit
    Maulik, Ujjwal
    [J]. INFORMATION SCIENCES, 2014, 267 : 35 - 57
  • [9] An association-based dissimilarity measure for categorical data
    Le, SQ
    Ho, TB
    [J]. PATTERN RECOGNITION LETTERS, 2005, 26 (16) : 2549 - 2557
  • [10] Graph Enhanced Fuzzy Clustering for Categorical Data Using a Bayesian Dissimilarity Measure
    Zhang, Chuanbin
    Chen, Long
    Zhao, Yin-Ping
    Wang, Yingxu
    Chen, C. L. Philip
    [J]. IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2023, 31 (03) : 810 - 824