Improving Clustering Efficiency by SimHash-based K-Means Algorithm for Big Data Analytics

被引:0
|
作者
Wang, Jenq-Haur [1 ]
Lin, Jia-Zhi [1 ]
机构
[1] Natl Taipei Univ Technol, Dept Comp Sci & Informat Engn, Taipei, Taiwan
关键词
Document Clustering; SimHash; K-Means; Dimension Reduction; Similarity Calculation;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
K-Means algorithm is one of the most popular methods for flat clustering, but it's time-consuming in similarity calculation for big data, which causes lower performance in practice. Previous studies proposed improvements for finding better initial centroids to facilitate effective assignment of the data points to suitable clusters with reduced time complexity. However, in vector space representation, as the data volume increases, the dimension of vector space becomes higher which takes more time in similarity calculation. In this paper, we propose a SimHash-based K-Means clustering algorithm that used locality-sensitive hashing and dimensionality reduction to improve the efficiency in big data analytics. The experimental results showed that our proposed method greatly reduces the processing time of K-Means clustering without significantly affecting the effectiveness. Further investigation is needed to verify the performance for data in larger scale.
引用
收藏
页码:1881 / 1888
页数:8
相关论文
共 50 条
  • [41] Data clustering using K-Means based on Crow Search Algorithm
    Lakshmi, K.
    Visalakshi, N. Karthikeyani
    Shanthi, S.
    [J]. SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2018, 43 (11):
  • [42] Enhanced Data Lake Clustering Design based on K-means Algorithm
    Kachaoui, Jabrane
    Belangour, Abdessamad
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (04) : 547 - 554
  • [43] The SKM Algorithm: A K-Means Algorithm for Clustering Sequential Data
    Dias, Jose G.
    Cortinhal, Maria Joao
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA 2008, PROCEEDINGS, 2008, 5290 : 173 - 182
  • [44] Cloud-based Educational Big Data Application of Apriori algorithm and K-Means Clustering algorithm based on Students' Information
    Yi, Jiaqu
    Li, Sizhe
    Wu, Maomao
    Yeung, H. H. Au
    Fok, Wilton W. T.
    Wang, Ying
    Liu, Fang
    [J]. 2014 IEEE FOURTH INTERNATIONAL CONFERENCE ON BIG DATA AND CLOUD COMPUTING (BDCLOUD), 2014, : 151 - 158
  • [45] Performance Enhancement of Distributed K-Means Clustering for Big Data Analytics Through In-memory Computation
    Ketu, Shwet
    Agarwal, Sonali
    [J]. 2015 EIGHTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2015, : 318 - 324
  • [46] An Improvement to the K-means Algorithm Oriented to Big Data
    Perez Ortega, Joaquin
    Rodolfo Pazos, R.
    Hidalgo, Miguel
    Almanza, Nelva
    Diaz-Parra, Ocotlan
    Santaolaya, Rene
    Caballero, Vitervo
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE OF NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2014 (ICNAAM-2014), 2015, 1648
  • [47] An efficient K-means clustering algorithm for tall data
    Capo, Marco
    Perez, Aritz
    Lozano, Jose A.
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 34 (03) : 776 - 811
  • [48] An efficient K-means clustering algorithm for tall data
    Marco Capó
    Aritz Pérez
    Jose A. Lozano
    [J]. Data Mining and Knowledge Discovery, 2020, 34 : 776 - 811
  • [49] An extension of the K-means algorithm to clustering skewed data
    Volodymyr Melnykov
    Xuwen Zhu
    [J]. Computational Statistics, 2019, 34 : 373 - 394
  • [50] Parallelization of K-Means Clustering Algorithm for Data Mining
    Jiang, Hao
    Yu, Liyan
    [J]. 4TH ANNUAL INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND APPLICATIONS (ITA 2017), 2017, 12