Clustering Big Data Based on Distributed Fuzzy K-Medoids: An Application to Geospatial Informatics

被引：5

作者：

Madbouly, Magda M. ^{[1
]}

Darwish, Saad M. ^{[1
]}

Bagi, Noha A. ^{[2
]}

Osman, Mohamed A. ^{[3
]}

机构：

[1] Alexandria Univ, Inst Grad Studies & Res, Dept Informat Technol, Alexandria 21526, Egypt

[2] Alexandria Water Co, Alexandria 21581, Egypt

[3] Higher Inst Management Informat Technol, Management Informat Syst Dept, Kafr Al Sheikh 33511, Egypt

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Clustering algorithms; Big Data; Geospatial analysis; Heuristic algorithms; Partitioning algorithms; Distributed databases; Scalability; Geospatial informatics; big data clustering; dynamic clustering; Apache Spark; fuzzy K-medoids;

D O I：

10.1109/ACCESS.2022.3149548

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The advent of big data related to spatial position knowledge, called geospatial big data, provides us with opportunities to recognize the urban environment. Existing database processing methods are inadequate to rapidly provide reliable results in a geospatial big data context due to the need for defining approximation "measures" and the increasing execution time for the queries. The clustering method yields the functional effects. How to scale and accelerate clustering algorithms while maintaining high clustering efficiency, on the other hand, remains a significant challenge. The paper's primary contribution is the introduction of a modified hierarchical distributed k-medoid clustering method that is specific to spatial query analysis for big data. To improve the efficiency of the k-medoid algorithm and obtain more precise clusters, the suggested model utilizes the Fuzzy k-Medoids method to overcome outliers in the spatial data set and to deal with data uncertainty. The method is complex in nature since it is not predicated on the number of right clusters. The proposed model is divided into two phases: the first step creates local clusters based on a portion of the entire dataset; this stage makes extensive use of the parallelism paradigm provided by the Apache Spark framework; and the second phase aggregates the local clusters to produce compact and reliable final clusters. The proposed model greatly reduces the amount of knowledge shared during the aggregation process and automatically produces the appropriate number of clusters based on the dataset characteristics. The results show that the proposed model outperforms the traditional K-medoids in terms of accuracy of obtained centers in big data applications.

引用

页码：20926 / 20936

页数：11

共 50 条

[1] Convex fuzzy k-medoids clustering
Pinheiro, Daniel N.
Aloise, Daniel
Blanchard, Simon J.
[J]. FUZZY SETS AND SYSTEMS, 2020, 389 : 66 - 92
[2] A K-medoids Based Clustering Scheme with an Application to Document Clustering
Onan, Aytug
[J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 354 - 359
[3] Kernel Based K-Medoids for Clustering Data with Uncertainty
Yang, Baoguo
Zhang, Yang
[J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2010, PT I, 2010, 6440 : 246 - 253
[4] Fuzzy kernel K-medoids clustering algorithm for uncertain data objects
Behnam Tavakkol
Youngdoo Son
[J]. Pattern Analysis and Applications, 2021, 24 : 1287 - 1302
[5] Fuzzy kernel K-medoids clustering algorithm for uncertain data objects
Tavakkol, Behnam
Son, Youngdoo
[J]. PATTERN ANALYSIS AND APPLICATIONS, 2021, 24 (03) : 1287 - 1302
[6] Clustering Uncertain Data Via K-Medoids
Gullo, Francesco
Ponti, Giovanni
Tagarelli, Andrea
[J]. SCALABLE UNCERTAINTY MANAGEMENT, SUM 2008, 2008, 5291 : 229 - 242
[7] The application of K-medoids and PAM to the clustering of rules
Reynolds, AP
Richards, G
Rayward-Smith, VJ
[J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING IDEAL 2004, PROCEEDINGS, 2004, 3177 : 173 - 178
[8] K-medoids Method based on Divergence for Uncertain Data Clustering
Zhou, Jin
Pan, Yuqi
Chen, C. L. Philip
Wang, Dong
Han, Shiyuan
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2016, : 2671 - 2674
[9] Application of the k-medoids Partitioning Algorithm for Clustering of Time Series Data
Radovanovic, Ana
Ye, Xinlin
Milanovic, Jovica, V
Milosavljevic, Nina
Storchi, Riccardo
[J]. 2020 IEEE PES INNOVATIVE SMART GRID TECHNOLOGIES EUROPE (ISGT-EUROPE 2020): SMART GRIDS: KEY ENABLERS OF A GREEN POWER SYSTEM, 2020, : 645 - 649
[10] K-Medoids Clustering and Fuzzy Sets for Isolation Forest
Karczmarek, Pawel
Kiersztyn, Adam
Pedrycz, Witold
Badurowicz, Marcin
Czerwinski, Dariusz
Montusiewicz, Jerzy
[J]. IEEE CIS INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS 2021 (FUZZ-IEEE), 2021,

← 1 2 3 4 5 →