Scalable Parallel Algorithms for Shared Nearest Neighbor Clustering

被引:0
|
作者
Kumari, Sonal [1 ]
Maurya, Saurabh [1 ]
Goyal, Poonam [1 ]
Balasubramaniam, Sundar S. [1 ]
Goyal, Navneet [1 ]
机构
[1] BITS Pilani, Dept Comp Sci & Informat Syst, Adv Data Analyt & Parallel Technol Lab, Pilani Campus, Pilani, Rajasthan, India
关键词
Parallel algorithm; shared nearest neighbor; data mining; clustering; high-dimensional data; SNN;
D O I
10.1109/HiPC.2016.16
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering is a popular data mining technique which discovers structure in unlabeled data by grouping objects together on the basis of a similarity criterion. Traditional similarity measures lose their meaning as the number of dimensions increases and as a consequence, distance or density based clustering algorithms become less meaningful. Shared Nearest Neighbor (SNN) is a solution to clustering high-dimensional data with the ability to find clusters of varying density. SNN assigns objects to a cluster, which share a large number of their nearest neighbors. However, SNN is compute and memory intensive for data of large size and/or dimensionality. Nearest neighbor queries are responsible for a major proportion of computations in SNN, resulting in lower efficiency for higher value of number of nearest neighbors (k). The main motivation of this work is to improve the efficiency of SNN and to parallelize it so that it can be used for clustering large high-dimensional datasets and for large values of k. Existing SNN algorithms become inefficient in these situations. In this paper, we present a new sequential SNN algorithm, R-SNN, which uses R-tree for executing neighborhood queries efficiently and exploiting spatial locality to minimize memory usage. R-SNN is benchmarked against the best available implementation of SNN and is found up to 77 times faster when tested on various real datasets. R-SNN is parallelized for distributed memory, shared memory, and hybrid systems. Significant speedup and scalability achieved can be attributed to parallelization and good load balancing strategies and also to exploitation of spatial locality. Experimental results demonstrate the same for datasets of varying dimensionality and size. The maximum speedup achieved for shared, distributed, and hybrid models are 427.19 using 48 threads, 394.24 using 32 processes, and 1380.69 on 32 nodes (with each node spawning 4 threads), respectively. Super-linear speedup for some datasets is attributed to optimized neighborhood queries. All the proposed algorithms produce identical clustering results as that of the classical SNN.
引用
收藏
页码:72 / 81
页数:10
相关论文
共 50 条
  • [1] Fuzzy Shared Nearest Neighbor Clustering
    Rika Sharma
    Kesari Verma
    [J]. International Journal of Fuzzy Systems, 2019, 21 : 2667 - 2678
  • [2] Fuzzy Shared Nearest Neighbor Clustering
    Sharma, Rika
    Verma, Kesari
    [J]. INTERNATIONAL JOURNAL OF FUZZY SYSTEMS, 2019, 21 (08) : 2667 - 2678
  • [3] Scalable Nearest Neighbor Algorithms for High Dimensional Data
    Muja, Marius
    Lowe, David G.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (11) : 2227 - 2240
  • [4] MR-SNN: Design of Parallel Shared Nearest Neighbor Clustering Algorithm Using MapReduce
    Wang, Sujing
    Eick, Christoph F.
    [J]. 2017 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA), 2017, : 317 - 320
  • [5] ParlayANN: Scalable and Deterministic Parallel Graph-Based Approximate Nearest Neighbor Search Algorithms
    Manohar, Magdalen Dobson
    Shen, Zheqi
    Blelloch, Guy E.
    Dhulipala, Laxman
    Gu, Yan
    Simhadri, Harsha Vardhan
    Sun, Yihan
    [J]. PROCEEDINGS OF THE 29TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2024, 2024, : 270 - 285
  • [6] High-dimensional shared nearest neighbor clustering algorithm
    Yin, J
    Fan, XL
    Chen, YQ
    Ren, JT
    [J]. FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PT 2, PROCEEDINGS, 2005, 3614 : 494 - 502
  • [7] Research and Application of Clustering Algorithm Based on Shared Nearest Neighbor
    Ye, Hanmin
    Bai, Xue
    Lv, Hao
    [J]. 2017 INTERNATIONAL CONFERENCE ON GREEN INFORMATICS (ICGI), 2017, : 11 - 16
  • [8] Shared Nearest Neighbor Clustering in a Locality Sensitive Hashing Framework
    Kanj, Sawsan
    Bruls, Thomas
    Gazut, Stephane
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2018, 25 (02) : 236 - 250
  • [9] An Improved Clustering Algorithm Based on Density and Shared Nearest Neighbor
    Ye, Hanmin
    Lv, Hao
    Sun, Qianting
    [J]. 2016 IEEE INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC), 2016, : 37 - 40
  • [10] Incremental Shared Nearest Neighbor Density-Based Clustering
    Singh, Sumeet
    Awekar, Amit
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1533 - 1536