PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication

被引:1
|
作者
Wang, Yue [1 ]
Narasayya, Vivek [1 ]
He, Yeye [1 ]
Chaudhuri, Surajit [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2022年 / 15卷 / 06期
关键词
PARALLEL ALGORITHMS;
D O I
10.14778/3514061.3514062
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2x to 19x (median=9x) speedup across a variety of synthetic and real-world datasets.
引用
收藏
页码:1132 / 1145
页数:14
相关论文
共 50 条
  • [21] AHSCAN: Agglomerative Hierarchical Structural Clustering Algorithm for Networks
    Yuruk, Nurcan
    Mete, Mutlu
    Xu, Xiaowei
    Schweiger, Thomas A. J.
    2009 INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING, 2009, : 72 - +
  • [23] EFFICIENT ALGORITHMS FOR AGGLOMERATIVE HIERARCHICAL-CLUSTERING METHODS
    DAY, WHE
    EDELSBRUNNER, H
    JOURNAL OF CLASSIFICATION, 1984, 1 (01) : 7 - 24
  • [24] Efficient Agglomerative Hierarchical Clustering for Biological Sequence Analysis
    Thuy-Diem Nguyen
    Kwoh, Chee-Keong
    TENCON 2015 - 2015 IEEE REGION 10 CONFERENCE, 2015,
  • [25] A Degenerate Agglomerative Hierarchical Clustering Algorithm for Community Detection
    Fiscarelli, Antonio Maria
    Beliakov, Aleksandr
    Konchenko, Stanislav
    Bouvry, Pascal
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2018, PT I, 2018, 10751 : 234 - 242
  • [26] An agglomerative hierarchical clustering algorithm for linear ordinal rankings
    Liu, Nana
    Xu, Zeshui
    Zeng, Xiao-Jun
    Ren, Peijia
    INFORMATION SCIENCES, 2021, 557 : 170 - 193
  • [27] Anomaly Detection Using Agglomerative Hierarchical Clustering Algorithm
    Mazarbhuiya, Fokrul Alom
    AlZahrani, Mohammed Y.
    Georgieva, Lilia
    INFORMATION SCIENCE AND APPLICATIONS 2018, ICISA 2018, 2019, 514 : 475 - 484
  • [28] Partition-based parallel PageRank algorithm
    Rungsawang, A
    Manaskasemsak, B
    Third International Conference on Information Technology and Applications, Vol 2, Proceedings, 2005, : 57 - 62
  • [29] Partition-based mass clustering of tractography streamlines
    Visser, Eelke
    Nijhuis, Emil H. J.
    Buitelaar, Jan K.
    Zwiers, Marcel P.
    NEUROIMAGE, 2011, 54 (01) : 303 - 312
  • [30] An Automated Clustering Algorithm Based On Agglomerative Clustering
    Karabina, Armagan
    Kilic, Erdal
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 1801 - 1804