Improving Clustering Efficiency by SimHash-based K-Means Algorithm for Big Data Analytics

被引：0

作者：

Wang, Jenq-Haur ^{[1
]}

Lin, Jia-Zhi ^{[1
]}

机构：

[1] Natl Taipei Univ Technol, Dept Comp Sci & Informat Engn, Taipei, Taiwan

来源：

2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2016年

关键词：

Document Clustering; SimHash; K-Means; Dimension Reduction; Similarity Calculation;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

K-Means algorithm is one of the most popular methods for flat clustering, but it's time-consuming in similarity calculation for big data, which causes lower performance in practice. Previous studies proposed improvements for finding better initial centroids to facilitate effective assignment of the data points to suitable clusters with reduced time complexity. However, in vector space representation, as the data volume increases, the dimension of vector space becomes higher which takes more time in similarity calculation. In this paper, we propose a SimHash-based K-Means clustering algorithm that used locality-sensitive hashing and dimensionality reduction to improve the efficiency in big data analytics. The experimental results showed that our proposed method greatly reduces the processing time of K-Means clustering without significantly affecting the effectiveness. Further investigation is needed to verify the performance for data in larger scale.

引用

页码：1881 / 1888

页数：8

共 50 条

[41] Data clustering using K-Means based on Crow Search Algorithm
Lakshmi, K.
Visalakshi, N. Karthikeyani
Shanthi, S.
[J]. SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2018, 43 (11):
[42] Enhanced Data Lake Clustering Design based on K-means Algorithm
Kachaoui, Jabrane
Belangour, Abdessamad
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (04) : 547 - 554
[43] The SKM Algorithm: A K-Means Algorithm for Clustering Sequential Data
Dias, Jose G.
Cortinhal, Maria Joao
[J]. ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA 2008, PROCEEDINGS, 2008, 5290 : 173 - 182
[44] Cloud-based Educational Big Data Application of Apriori algorithm and K-Means Clustering algorithm based on Students' Information
Yi, Jiaqu
Li, Sizhe
Wu, Maomao
Yeung, H. H. Au
Fok, Wilton W. T.
Wang, Ying
Liu, Fang
[J]. 2014 IEEE FOURTH INTERNATIONAL CONFERENCE ON BIG DATA AND CLOUD COMPUTING (BDCLOUD), 2014, : 151 - 158
[45] Performance Enhancement of Distributed K-Means Clustering for Big Data Analytics Through In-memory Computation
Ketu, Shwet
Agarwal, Sonali
[J]. 2015 EIGHTH INTERNATIONAL CONFERENCE ON CONTEMPORARY COMPUTING (IC3), 2015, : 318 - 324
[46] An Improvement to the K-means Algorithm Oriented to Big Data
Perez Ortega, Joaquin
Rodolfo Pazos, R.
Hidalgo, Miguel
Almanza, Nelva
Diaz-Parra, Ocotlan
Santaolaya, Rene
Caballero, Vitervo
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE OF NUMERICAL ANALYSIS AND APPLIED MATHEMATICS 2014 (ICNAAM-2014), 2015, 1648
[47] An efficient K-means clustering algorithm for tall data
Capo, Marco
Perez, Aritz
Lozano, Jose A.
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2020, 34 (03) : 776 - 811
[48] An efficient K-means clustering algorithm for tall data
Marco Capó
Aritz Pérez
Jose A. Lozano
[J]. Data Mining and Knowledge Discovery, 2020, 34 : 776 - 811
[49] An extension of the K-means algorithm to clustering skewed data
Volodymyr Melnykov
Xuwen Zhu
[J]. Computational Statistics, 2019, 34 : 373 - 394
[50] Parallelization of K-Means Clustering Algorithm for Data Mining
Jiang, Hao
Yu, Liyan
[J]. 4TH ANNUAL INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND APPLICATIONS (ITA 2017), 2017, 12

← 1 2 3 4 5 →