An LSH-based k-representatives clustering method for large categorical data

被引:7
|
作者
Mau, Toan Nguyen [1 ]
Huynh, Van-Nam [1 ]
机构
[1] Japan Adv Inst Sci & Technol, Sch Adv Sci & Technol, Nomi, Ishikawa, Japan
关键词
Categorical data; Clustering; Dissimilarity measure; k-Means like algorithm; Locality-Sensitive Hashing; DISSIMILARITY MEASURE; MODES ALGORITHM; PROTOTYPES;
D O I
10.1016/j.neucom.2021.08.050
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Clustering categorical data remains a challenging problem in the era of big data, due to the difficulty in measuring dis/similarity meaningfully for categorical data and the high computational complexity of existing clustering algorithms that makes it difficult to be applied in practical use for big data mining applications. In this paper, we propose an integrated approach that incorporates the Locality-Sensitive Hashing (LSH) technique into the k-means-like clustering so as to make it capable of predicting the better initial clusters for boosting clustering effectiveness. To this end, we first utilize a data-driven dissimilarity measure for categorical data to construct a family of binary hash functions that are then used to generate the initial clusters. We also propose to use a nearest neighbor search at each iteration for cluster reassignment of data objects to improve the clustering complexity. These solutions are incorporated into the k representatives algorithm resulting in the so-called LSH-k-representatives algorithm. Extensive experiments conducted on multiple real-world and synthetic datasets have demonstrated the effectiveness of the proposed method. It is shown that the newly developed algorithm yields comparable or better clustering results in comparison to the existing closely related works, yet it is significantly more efficient by a factor of between 2x and 32x. (c) 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
引用
收藏
页码:29 / 44
页数:16
相关论文
共 50 条
  • [1] Kernel-Based k-Representatives Algorithm for Fuzzy Clustering of Categorical Data
    Mau, Toan Nguyen
    Huynh, Van-Nam
    [J]. IEEE CIS INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS 2021 (FUZZ-IEEE), 2021,
  • [2] Clustering Categorical Data Based on Representatives
    Aranganayagi, S.
    Thangavel, K.
    [J]. THIRD 2008 INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY, VOL 1, PROCEEDINGS, 2008, : 599 - +
  • [3] LSH-Based Large Scale Chinese Calligraphic Character Recognition
    Lin, Yuan
    Wu, Jiangqin
    Gao, Pengcheng
    Xia, Yang
    Mao, Tianjiao
    [J]. JCDL'13: PROCEEDINGS OF THE 13TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES, 2013, : 323 - 329
  • [4] A Generic Method for Accelerating LSH-Based Similarity Join Processing
    Yu, Chenyun
    Nutanong, Sarana
    Li, Hangyu
    Wang, Cong
    Yuan, Xingliang
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017, 29 (04) : 712 - 726
  • [5] LSH-based Collaborative Recommendation Method with Privacy-Preservation
    Xu, Jiangmin
    Li, Xuansong
    Wang, Hao
    Dai, Hong-Ning
    Meng, Shunmei
    [J]. 2020 IEEE 13TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2020), 2020, : 566 - 573
  • [6] LSH-based semantic dictionary learning for large scale image understanding
    Li, Liang
    Yan, Chenggang Clarence
    Ji, Wen
    Chen, Bo-Wei
    Jiang, Shuqiang
    Huang, Qingming
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2015, 31 : 231 - 236
  • [7] A fast LSH-based similarity search method for multivariate time series
    Yu, Chenyun
    Luo, Lintong
    Chan, Leanne Lai-Hang
    Rakthanmanon, Thanawin
    Nutanong, Sarana
    [J]. INFORMATION SCIENCES, 2019, 476 : 337 - 356
  • [8] A method for k-means-like clustering of categorical data
    Nguyen T.-H.T.
    Dinh D.-T.
    Sriboonchitta S.
    Huynh V.-N.
    [J]. Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (11) : 15011 - 15021
  • [9] A Roughset Based Data Labeling Method for Clustering Categorical Data
    Reddy, H. Venkateswara
    Raju, S. Viswanadha
    [J]. 2014 3RD INTERNATIONAL CONFERENCE ON ECO-FRIENDLY COMPUTING AND COMMUNICATION SYSTEMS (ICECCS 2014), 2014, : 51 - 55
  • [10] A Generic Method for Accelerating LSH-based Similarity Join Processing (Extended abstract)
    Yu, Chenyun
    Nutanong, Sarana
    Li, Hangyu
    Wang, Cong
    Yuan, Xingliang
    [J]. 2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 29 - 30