PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication

被引:1
|
作者
Wang, Yue [1 ]
Narasayya, Vivek [1 ]
He, Yeye [1 ]
Chaudhuri, Surajit [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2022年 / 15卷 / 06期
关键词
PARALLEL ALGORITHMS;
D O I
10.14778/3514061.3514062
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2x to 19x (median=9x) speedup across a variety of synthetic and real-world datasets.
引用
收藏
页码:1132 / 1145
页数:14
相关论文
共 50 条
  • [31] A partition-based global optimization algorithm
    Giampaolo Liuzzi
    Stefano Lucidi
    Veronica Piccialli
    Journal of Global Optimization, 2010, 48 : 113 - 128
  • [32] A partition-based global optimization algorithm
    Liuzzi, Giampaolo
    Lucidi, Stefano
    Piccialli, Veronica
    JOURNAL OF GLOBAL OPTIMIZATION, 2010, 48 (01) : 113 - 128
  • [33] Rapid Prototyping of Hierarchical Agglomerative Clustering Algorithms for Distributed Systems
    Islam, Saiyedul
    Goyal, Navneet
    Balasubramaniam, Sundar
    Goyal, Poonam
    Agarwal, Achal
    Rathore, Kirti Singh
    Singh, Nischay
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 307 - 316
  • [34] An automatic partition-based parallel algorithm for grid-based distributed hydrological models
    Xu, Zhenwu
    Tang, Guoping
    Jiang, Tao
    Chen, Xiaohua
    Chen, Tao
    Niu, Xiangyu
    ENVIRONMENTAL MODELLING & SOFTWARE, 2021, 144
  • [35] Multiviewpoint-Based Agglomerative Hierarchical Clustering
    Fujiwara, Yuji
    Koga, Hisashi
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT II, 2019, 11707 : 325 - 340
  • [36] An ensemble agglomerative hierarchical clustering algorithm based on clusters clustering technique and the novel similarity measurement
    Li, Teng
    Rezaeipanah, Amin
    El Din, ElSayed M. Tag
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (06) : 3828 - 3842
  • [37] Cavitation Diagnosis Method for Centrifugal Pumps based on Agglomerative Hierarchical Clustering Algorithm
    Huang H.M.
    Liu Y.
    Wu D.H.
    Wu Y.Z.
    Wu T.X.
    International Journal of Fluid Machinery and Systems, 2023, 16 (01) : 89 - 97
  • [38] Research on Optimal Design of Civil Sensors Based on Agglomerative Hierarchical Clustering Algorithm
    Cheng, Xingyan
    Zhu, Linyan
    Cheng, Yimei
    TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2024, 31 (05): : 1455 - 1463
  • [39] Intelligent Logistics Supplier Selection Based On Improved Agglomerative Hierarchical Clustering Algorithm
    Zhang, Yajie
    Lv, Yaqiong
    Tu, Lei
    Hou, Yueqiu
    2019 IEEE 17TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2019, : 1309 - 1314
  • [40] A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework
    Gao H.
    Jiang J.
    She L.
    Fu Y.
    International Journal of Digital Content Technology and its Applications, 2010, 4 (03) : 95 - 100