A stratified sampling based clustering algorithm for large-scale data

被引:34
|
作者
Zhao, Xingwang [1 ,2 ]
Liang, Jiye [1 ]
Dang, Chuangyin [2 ]
机构
[1] Shanxi Univ, Sch Comp & Informat Technol, Key Lab Computat Intelligence & Chinese Informat, Minist Educ, Taiyuan 030006, Shanxi, Peoples R China
[2] City Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China
关键词
Large-scale data; Fuzzy c-means algorithm; Stratified sampling; Data labeling;
D O I
10.1016/j.knosys.2018.09.007
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale data analysis is a challenging and relevant task for present-day research and industry. As a promising data analysis tool, clustering is becoming more important in the era of big data. In large-scale data clustering, sampling is an efficient and most widely used approximation technique. Recently, several sampling-based clustering algorithms have attracted considerable attention in large-scale data analysis owing to their efficiency. However, some of these existing algorithms have low clustering accuracy, whereas others have high computational complexity. To overcome these deficiencies, a stratified sampling based clustering algorithm for large-scale data is proposed in this paper. Its basic steps include: (1) obtaining a number of representative samples from different strata with a stratified sampling scheme, which are formed by locality sensitive hashing technique, (2) partitioning the chosen samples into different clusters using the fuzzy c-means clustering algorithm, (3) assigning the out-of-sample objects into their closest clusters via data labeling technique. The performance of the proposed algorithm is compared with the state-of-the-art sampling-based fuzzy c-means clustering algorithms on several large-scale data sets including synthetic and real ones. The experimental results show that the proposed algorithm outperforms the related algorithms in terms of clustering quality and computational efficiency for large-scale data sets. (C) 2018 Published by Elsevier B.V.
引用
收藏
页码:416 / 428
页数:13
相关论文
共 50 条
  • [1] A Sampling-Based Density Peaks Clustering Algorithm for Large-Scale Data
    Ding, Shifei
    Li, Chao
    Xu, Xiao
    Ding, Ling
    Zhang, Jian
    Guo, Lili
    Shi, Tianhao
    [J]. PATTERN RECOGNITION, 2023, 136
  • [2] A Sampling-Based Graph Clustering Algorithm for Large-Scale Networks
    Zhang, Jian-Peng
    Chen, Hong-Chang
    Wang, Kai
    Zhu, Kai-Jie
    Wang, Ya-Wen
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2019, 47 (08): : 1731 - 1737
  • [3] CLUSTERING LARGE-SCALE DATA BASED ON MODIFIED AFFINITY PROPAGATION ALGORITHM
    Serdah, Ahmed M.
    Ashour, Wesam M.
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH, 2016, 6 (01) : 23 - 33
  • [4] Fuzzy clustering algorithm based on multiple medoids for large-scale data
    Chen, Ai-Guo
    Wang, Shi-Tong
    [J]. Kongzhi yu Juece/Control and Decision, 2016, 31 (12): : 2122 - 2130
  • [5] DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets
    Li, Minne
    Li, Dongsheng
    Shen, Siqi
    Zhang, Zhaoning
    Lu, Xicheng
    [J]. NETWORK AND PARALLEL COMPUTING, 2016, 9966 : 133 - 146
  • [6] A Novel Clustering Algorithm on Large-Scale Graph Data
    Zhang, Hao
    Zhou, Wei
    Wan, Xiaoyu
    Fu, Ge
    Xu, Zhiyong
    Han, Jizhong
    [J]. 2014 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA (CCBD), 2014, : 47 - 54
  • [7] MapReduce-based Dragonfly Algorithm for large-scale Data-Clustering
    Tripathi, Ashish Kumar
    Saxena, Pranav
    Gupta, Siddharth
    [J]. 2019 FIFTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP 2019), 2019, : 171 - 175
  • [8] Large-Scale Data Clustering Algorithm Based on Quantum Immune Regulation Network
    Li, Yangyang
    Bai, Xiaoyu
    Hou, Xiaoju
    Jiao, Licheng
    [J]. 2017 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2017,
  • [9] Affinity propagation clustering algorithm based on large-scale data-set
    Wang, Limin
    Zheng, Kaiyue
    Tao, Xing
    Han, Xuming
    [J]. International Journal of Computers and Applications, 2018, 40 (03) : 1 - 6
  • [10] A study of large-scale data clustering based on fuzzy clustering
    Li, Yangyang
    Yang, Guoli
    He, Haiyang
    Jiao, Licheng
    Shang, Ronghua
    [J]. SOFT COMPUTING, 2016, 20 (08) : 3231 - 3242