Approximate Clustering Ensemble Method for Big Data

被引:11
|
作者
Mahmud, Mohammad Sultan [1 ,2 ]
Huang, Joshua Zhexue [1 ,2 ]
Ruby, Rukhsana [3 ]
Ngueilbaye, Alladoumbaye [1 ,2 ]
Wu, Kaishun [1 ,2 ]
机构
[1] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China
[3] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen 518107, Peoples R China
基金
中国国家自然科学基金;
关键词
Clustering approximation method; clustering ensemble; consensus functions; distributed clustering; RSP data model; K-MEANS; I-NICE; NUMBER; ALGORITHM; CONSENSUS; MODELS;
D O I
10.1109/TBDATA.2023.3255003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in distributed computing. A popular method to tackle this problem is to use a random sample of the big dataset to compute an approximate result as an estimation of the true result computed from the entire dataset. In this paper, instead of using a single random sample, we use multiple random samples to compute an ensemble result as the estimation of the true result of the big dataset. We propose a distributed computing framework to compute the ensemble result. In this framework, a big dataset is represented in the RSP data model as random sample data blocks managed in a distributed file system. To compute the ensemble clustering result, a set of RSP data blocks is randomly selected as random samples and clustered independently in parallel on the nodes of a cluster to generate the component clustering results. The component results are transferred to the master node, which computes the ensemble result. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods to integrate the component clustering results into the final ensemble result. The first method uses component cluster centers to build a graph and the METIS algorithm to cut the graph into subgraphs, from which a set of candidate cluster centers is found. A hierarchical clustering method is then used to generate the final set of k cluster centers. The second method uses the clustering-by-passing-messages method to generate the final set of k cluster centers. Finally, the k-means algorithm was used to allocate the entire dataset into k clusters. Experiments were conducted on both synthetic and real-world datasets. The results show that the new ensemble clustering methods performed better than the comparison methods and that the distributed computing framework is efficient and scalable in clustering big datasets.
引用
收藏
页码:1142 / 1155
页数:14
相关论文
共 50 条
  • [21] Stratified feature sampling method for ensemble clustering of high dimensional data
    Jing, Liping
    Tian, Kuang
    Huang, Joshua Z.
    [J]. PATTERN RECOGNITION, 2015, 48 (11) : 3688 - 3702
  • [22] Unsupervised Method to Ensemble Results of Multiple Clustering Solutions for Bibliographic Data
    Mishra, Sumit
    Saha, Sriparna
    Mondal, Samrat
    [J]. 2017 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2017, : 1459 - 1466
  • [23] CLUSTERING-BASED SUBSET ENSEMBLE LEARNING METHOD FOR IMBALANCED DATA
    Hu, Xiao-Sheng
    Zhang, Run-Jing
    [J]. PROCEEDINGS OF 2013 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), VOLS 1-4, 2013, : 35 - 39
  • [24] Big Data Clustering: A Review
    Shirkhorshidi, Ali Seyed
    Aghabozorgi, Saeed
    Teh, Ying Wah
    Herawan, Tutut
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2014, PT V, 2014, 8583 : 707 - 720
  • [25] MapReduce Clustering for Big Data
    Ghattas, Badih
    Pinto, Antoine
    Diao, Sambou
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 5116 - 5124
  • [26] Big data analysis using a parallel ensemble clustering architecture and an unsupervised feature selection approach
    Wang, Yubo
    Saraswat, Shelesh Krishna
    Komari, Iraj Elyasi
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (01) : 270 - 282
  • [27] Fuzzy Divergence Weighted Ensemble Clustering With Spectral Learning Based on Random Projections for Big Data
    Lahmar, Ines
    Zaier, Aida
    Yahia, Mohamed
    Ali, Tarig
    Boaullegue, Ridha
    [J]. IEEE ACCESS, 2024, 12 : 20197 - 20208
  • [28] Big Data and Clustering Algorithms
    Ajin, V. W.
    Kumar, Lekshmy D.
    [J]. 2016 INTERNATIONAL CONFERENCE ON RESEARCH ADVANCES IN INTEGRATED NAVIGATION SYSTEMS (RAINS), 2016,
  • [29] Strategies for Big Data Clustering
    Kurasova, Olga
    Marcinkevicius, Virginijus
    Medvedev, Viktor
    Rapecka, Aurimas
    Stefanovic, Pavel
    [J]. 2014 IEEE 26TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2014, : 740 - 747
  • [30] Consensus Clustering on Big Data
    Liu, Hongfu
    Cheng, Gong
    Wu, Junjie
    [J]. 2015 12TH INTERNATIONAL CONFERENCE ON SERVICE SYSTEMS AND SERVICE MANAGEMENT (ICSSSM), 2015,