Approximate Clustering Ensemble Method for Big Data

被引：11

作者：

Mahmud, Mohammad Sultan ^{[1
,2
]}

Huang, Joshua Zhexue ^{[1
,2
]}

Ruby, Rukhsana ^{[3
]}

Ngueilbaye, Alladoumbaye ^{[1
,2
]}

Wu, Kaishun ^{[1
,2
]}

机构：

[1] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China

[2] Shenzhen Univ, Coll Comp Sci & Software Engn, Big Data Inst, Shenzhen 518060, Peoples R China

[3] Guangdong Lab Artificial Intelligence & Digital Ec, Shenzhen 518107, Peoples R China

来源：

IEEE TRANSACTIONS ON BIG DATA | 2023年 / 9卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Clustering approximation method; clustering ensemble; consensus functions; distributed clustering; RSP data model; K-MEANS; I-NICE; NUMBER; ALGORITHM; CONSENSUS; MODELS;

D O I：

10.1109/TBDATA.2023.3255003

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in distributed computing. A popular method to tackle this problem is to use a random sample of the big dataset to compute an approximate result as an estimation of the true result computed from the entire dataset. In this paper, instead of using a single random sample, we use multiple random samples to compute an ensemble result as the estimation of the true result of the big dataset. We propose a distributed computing framework to compute the ensemble result. In this framework, a big dataset is represented in the RSP data model as random sample data blocks managed in a distributed file system. To compute the ensemble clustering result, a set of RSP data blocks is randomly selected as random samples and clustered independently in parallel on the nodes of a cluster to generate the component clustering results. The component results are transferred to the master node, which computes the ensemble result. Since the random samples are disjoint and traditional consensus functions cannot be used, we propose two new methods to integrate the component clustering results into the final ensemble result. The first method uses component cluster centers to build a graph and the METIS algorithm to cut the graph into subgraphs, from which a set of candidate cluster centers is found. A hierarchical clustering method is then used to generate the final set of k cluster centers. The second method uses the clustering-by-passing-messages method to generate the final set of k cluster centers. Finally, the k-means algorithm was used to allocate the entire dataset into k clusters. Experiments were conducted on both synthetic and real-world datasets. The results show that the new ensemble clustering methods performed better than the comparison methods and that the distributed computing framework is efficient and scalable in clustering big datasets.

引用

页码：1142 / 1155

页数：14

共 50 条

[31] Mapreduce fuzzy c-means ensemble clustering with gentle adaboost for big data analytics
Padmapriya, K.M.
Anandhi, B.
Vijayakumar, M.
[J]. International Journal of Business Intelligence and Data Mining, 2021, 19 (02): : 170 - 188
[32] Big Data clustering validity
Tlili, Monia
Hamdani, Tarek M.
[J]. 2014 6TH INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2014, : 348 - 352
[33] Big Data Needs Approximate Computing
Nair, Ravi
[J]. COMMUNICATIONS OF THE ACM, 2015, 58 (01) : 104 - 104
[34] Approximate Computation for Big Data Analytics
Ma, Shuai
[J]. DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : XVIII - XVIII
[35] UNSUPERVISED EXTRACTION OF GREENHOUSES USING APPROXIMATE SPECTRAL CLUSTERING ENSEMBLE
Pala, Esma
Tasdemir, Kadim
Koc-San, Dilek
[J]. 2015 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2015, : 4668 - 4671
[36] Approximate queries on big heterogeneous data
Kantere, Verena
[J]. 2015 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2015, 2015, : 712 - 715
[37] An Approximate Search Framework for Big Data
Li, Shang
Zhou, Zhigang
Zhang, Hongli
Fang, Binxing
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2017,
[38] Incremental Clustering for Categorical Data Using Clustering Ensemble
Li Taoying
Chne Yan
Qu Lili
Mu Xiangwei
[J]. PROCEEDINGS OF THE 29TH CHINESE CONTROL CONFERENCE, 2010, : 2519 - 2524
[39] Extraction of hazelnut fields using approximate spectral clustering ensemble
Yalcin, Berna
Moazzen, Yaser
Tasdemir, Kadim
[J]. 2015 23RD SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2015, : 640 - 643
[40] Sampling based approximate spectral clustering ensemble for partitioning datasets
Moazzen, Yaser
Tasdemir, Kadim
[J]. 2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 1630 - 1635

← 1 2 3 4 5 →