Improving Clustering Efficiency by SimHash-based K-Means Algorithm for Big Data Analytics

被引:0
|
作者
Wang, Jenq-Haur [1 ]
Lin, Jia-Zhi [1 ]
机构
[1] Natl Taipei Univ Technol, Dept Comp Sci & Informat Engn, Taipei, Taiwan
关键词
Document Clustering; SimHash; K-Means; Dimension Reduction; Similarity Calculation;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
K-Means algorithm is one of the most popular methods for flat clustering, but it's time-consuming in similarity calculation for big data, which causes lower performance in practice. Previous studies proposed improvements for finding better initial centroids to facilitate effective assignment of the data points to suitable clusters with reduced time complexity. However, in vector space representation, as the data volume increases, the dimension of vector space becomes higher which takes more time in similarity calculation. In this paper, we propose a SimHash-based K-Means clustering algorithm that used locality-sensitive hashing and dimensionality reduction to improve the efficiency in big data analytics. The experimental results showed that our proposed method greatly reduces the processing time of K-Means clustering without significantly affecting the effectiveness. Further investigation is needed to verify the performance for data in larger scale.
引用
收藏
页码:1881 / 1888
页数:8
相关论文
共 50 条
  • [1] Canopy with k-means Clustering Algorithm for Big Data Analytics
    Sagheer, Noor S.
    Yousif, Suhad A.
    [J]. FOURTH INTERNATIONAL CONFERENCE OF MATHEMATICAL SCIENCES (ICMS 2020), 2021, 2334
  • [2] The fast clustering algorithm for the big data based on K-means
    Xie, Ting
    Zhang, Taiping
    [J]. INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2020, 18 (06)
  • [3] A Novel K-Means based Clustering Algorithm for Big Data
    Sinha, Ankita
    Jana, Prasanta K.
    [J]. 2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 1875 - 1879
  • [4] Improving the Accuracy and Efficiency of the k-means Clustering Algorithm
    Nazeer, K. A. Abdul
    Sebastian, M. P.
    [J]. WORLD CONGRESS ON ENGINEERING 2009, VOLS I AND II, 2009, : 308 - 312
  • [5] Modified K-means Algorithm for Big Data Clustering
    Sengupta, Debapriya
    Roy, Sayantan Singha
    Ghosh, Sarbani
    Dasgupta, Ranjan
    [J]. PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI), 2017, : 1443 - 1448
  • [6] Review on the Research of K-means Clustering Algorithm in Big Data
    Chen Jie
    Zhang Jiyue
    Wu Junhui
    Wu Yusheng
    Si Huiping
    Lin Kaiyan
    [J]. 2020 IEEE THE 3RD INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION ENGINEERING (ICECE), 2020, : 107 - 111
  • [7] K-MEANS plus : A DEVELOPED CLUSTERING ALGORITHM FOR BIG DATA
    Niu, Kun
    Gao, Zhipeng
    Jiao, Haizhen
    Deng, Nanjie
    [J]. PROCEEDINGS OF 2016 4TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (IEEE CCIS 2016), 2016, : 141 - 144
  • [8] Improvement of K-Means Algorithm for Accelerated Big Data Clustering
    Wu, Chunqiong
    Yan, Bingwen
    Yu, Rongrui
    Huang, Zhangshu
    Yu, Baoqin
    Yu, Yanliang
    Chen, Na
    Zhou, Xiukao
    [J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGIES AND SYSTEMS APPROACH, 2021, 14 (02) : 99 - 119
  • [9] Big Data Clustering Analysis Algorithm for Internet of Things Based on K-Means
    Yu, Zhanqiu
    [J]. INTERNATIONAL JOURNAL OF DISTRIBUTED SYSTEMS AND TECHNOLOGIES, 2019, 10 (01) : 1 - 12
  • [10] Enhancement of K-means clustering in big data based on equilibrium optimizer algorithm
    Al-kababchee, Sarah Ghanim Mahmood
    Algamal, Zakariya Yahya
    Qasim, Omar Saber
    [J]. JOURNAL OF INTELLIGENT SYSTEMS, 2023, 32 (01)