PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication

被引:1
|
作者
Wang, Yue [1 ]
Narasayya, Vivek [1 ]
He, Yeye [1 ]
Chaudhuri, Surajit [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2022年 / 15卷 / 06期
关键词
PARALLEL ALGORITHMS;
D O I
10.14778/3514061.3514062
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2x to 19x (median=9x) speedup across a variety of synthetic and real-world datasets.
引用
收藏
页码:1132 / 1145
页数:14
相关论文
共 50 条
  • [41] Divisive hierarchical clustering algorithm based on soft hyperspheric partition
    School of Information Technology, Jiangnan University, Wuxi 214122, China
    不详
    Moshi Shibie yu Rengong Zhineng, 2008, 4 (559-568):
  • [42] Spatial clustering algorithm based on hierarchical-partition tree
    Li Z.
    Wang X.
    International Journal of Digital Content Technology and its Applications, 2010, 4 (06) : 26 - 35
  • [43] A different quantity of partition-based efficient algorithm for reduction of attribute in information systems
    Li, Jin-hai
    Lv, Yue-jin
    Liu, Nan-xing
    FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 3, PROCEEDINGS, 2007, : 74 - 78
  • [44] The application of agglomerative hierarchical spatial clustering algorithm in tea blending
    Tie, Jun
    Chen, Wenying
    Sun, Chong
    Mao, Tengyue
    Xing, Guanglin
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 3): : S6059 - S6068
  • [45] The application of agglomerative hierarchical spatial clustering algorithm in tea blending
    Jun Tie
    Wenying Chen
    Chong Sun
    Tengyue Mao
    Guanglin Xing
    Cluster Computing, 2019, 22 : 6059 - 6068
  • [46] A new agglomerative 2-3 Hierarchical Clustering algorithm
    Chelcea, S
    Bertrand, P
    Trousse, B
    INNOVATIONS IN CLASSIFICATION, DATA SCIENCE, AND INFORMATION SYSTEMS, 2005, : 3 - 10
  • [47] Equal Area Partition-Based Energy Efficient Routing Algorithm for Circular WSN
    Hu Liqin
    Wang Sanyou
    Ma Fujun
    Zhang Shubo
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021
  • [48] Network partition-based hierarchical decentralised voltage control for distribution networks with distributed PV systems
    Luo, Chen
    Wu, Hongbin
    Zhou, Yiyao
    Qiao, Yida
    Cai, Mengyi
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2021, 130 (130)
  • [49] Partition-Based Clustering with Sliding Windows for Data Streams
    Youn, Jonghem
    Choi, Jihun
    Shim, Junho
    Lee, Sang-goo
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2017), PT II, 2017, 10178 : 289 - 303
  • [50] AN EFFICIENT AGGLOMERATIVE CLUSTERING-ALGORITHM USING A HEAP
    KURITA, T
    PATTERN RECOGNITION, 1991, 24 (03) : 205 - 209