Improving Clustering Efficiency by SimHash-based K-Means Algorithm for Big Data Analytics

被引：0

作者：

Wang, Jenq-Haur ^{[1
]}

Lin, Jia-Zhi ^{[1
]}

机构：

[1] Natl Taipei Univ Technol, Dept Comp Sci & Informat Engn, Taipei, Taiwan

来源：

2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2016年

关键词：

Document Clustering; SimHash; K-Means; Dimension Reduction; Similarity Calculation;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

K-Means algorithm is one of the most popular methods for flat clustering, but it's time-consuming in similarity calculation for big data, which causes lower performance in practice. Previous studies proposed improvements for finding better initial centroids to facilitate effective assignment of the data points to suitable clusters with reduced time complexity. However, in vector space representation, as the data volume increases, the dimension of vector space becomes higher which takes more time in similarity calculation. In this paper, we propose a SimHash-based K-Means clustering algorithm that used locality-sensitive hashing and dimensionality reduction to improve the efficiency in big data analytics. The experimental results showed that our proposed method greatly reduces the processing time of K-Means clustering without significantly affecting the effectiveness. Further investigation is needed to verify the performance for data in larger scale.

引用

页码：1881 / 1888

页数：8

共 50 条

[1] Canopy with k-means Clustering Algorithm for Big Data Analytics
Sagheer, Noor S.
Yousif, Suhad A.
[J]. FOURTH INTERNATIONAL CONFERENCE OF MATHEMATICAL SCIENCES (ICMS 2020), 2021, 2334
[2] The fast clustering algorithm for the big data based on K-means
Xie, Ting
Zhang, Taiping
[J]. INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)
[3] A Novel K-Means based Clustering Algorithm for Big Data
Sinha, Ankita
Jana, Prasanta K.
[J]. 2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 1875 - 1879
[4] Improving the Accuracy and Efficiency of the k-means Clustering Algorithm
Nazeer, K. A. Abdul
Sebastian, M. P.
[J]. WORLD CONGRESS ON ENGINEERING 2009, VOLS I AND II, 2009, : 308 - 312
[5] Modified K-means Algorithm for Big Data Clustering
Sengupta, Debapriya
Roy, Sayantan Singha
Ghosh, Sarbani
Dasgupta, Ranjan
[J]. PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), 2017, : 1443 - 1448
[6] Review on the Research of K-means Clustering Algorithm in Big Data
Chen Jie
Zhang Jiyue
Wu Junhui
Wu Yusheng
Si Huiping
Lin Kaiyan
[J]. 2020 IEEE THE 3RD INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION ENGINEERING (ICECE), 2020, : 107 - 111
[7] K-MEANS plus : A DEVELOPED CLUSTERING ALGORITHM FOR BIG DATA
Niu, Kun
Gao, Zhipeng
Jiao, Haizhen
Deng, Nanjie
[J]. PROCEEDINGS OF 2016 4TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (IEEE CCIS 2016), 2016, : 141 - 144
[8] Improvement of K-Means Algorithm for Accelerated Big Data Clustering
Wu, Chunqiong
Yan, Bingwen
Yu, Rongrui
Huang, Zhangshu
Yu, Baoqin
Yu, Yanliang
Chen, Na
Zhou, Xiukao
[J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGIES AND SYSTEMS APPROACH, 2021, 14 (02) : 99 - 119
[9] Big Data Clustering Analysis Algorithm for Internet of Things Based on K-Means
Yu, Zhanqiu
[J]. INTERNATIONAL JOURNAL OF DISTRIBUTED SYSTEMS AND TECHNOLOGIES, 2019, 10 (01) : 1 - 12
[10] Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm
Al-kababchee, Sarah Ghanim Mahmood
Algamal, Zakariya Yahya
Qasim, Omar Saber
[J]. JOURNAL OF INTELLIGENT SYSTEMS, 2023, 32 (01)

← 1 2 3 4 5 →