PACk: An Efficient Partition-based Distributed Agglomerative Hierarchical Clustering Algorithm for Deduplication

被引:1
|
作者
Wang, Yue [1 ]
Narasayya, Vivek [1 ]
He, Yeye [1 ]
Chaudhuri, Surajit [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2022年 / 15卷 / 06期
关键词
PARALLEL ALGORITHMS;
D O I
10.14778/3514061.3514062
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Agglomerative Hierarchical Clustering (AHC) algorithm is widely used in real-world applications. As data volumes continue to grow, efficient scale-out techniques for AHC are becoming increasingly important. In this paper, we propose a Partition-based distributed Agglomerative Hierarchical Clustering (PACk) algorithm using novel distance-based partitioning and distance-aware merging techniques. We have developed an efficient implementation of PACk on Spark. Compared to the state-of-the-art distributed AHC algorithm, PACk achieves 2x to 19x (median=9x) speedup across a variety of synthetic and real-world datasets.
引用
收藏
页码:1132 / 1145
页数:14
相关论文
共 50 条
  • [1] A Graph Partition-based Soft Clustering Algorithm
    Chen Jianbin
    Fang Deying
    Shi Tong
    2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL II, PROCEEDINGS, 2008, : 572 - 577
  • [2] Hierarchical Agglomerative Clustering Algorithm method for distributed generation planning
    Vinothkumar, K.
    Selvan, M. P.
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2014, 56 : 259 - 269
  • [3] Efficient agglomerative hierarchical clustering
    Bouguettaya, Athman
    Yu, Qi
    Liu, Xumin
    Zhou, Xiangmin
    Song, Andy
    EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (05) : 2785 - 2797
  • [4] Development of an efficient hierarchical clustering analysis using an agglomerative clustering algorithm
    Naeem, Arshia
    Rehman, Mariam
    Anjum, Maria
    Asif, Muhammad
    CURRENT SCIENCE, 2019, 117 (06): : 1045 - 1053
  • [5] An efficient partition-based parallel PageRank algorithm
    Manaskasemsak, B
    Rungsawang, A
    11th International Conference on Parallel and Distributed Systems, Vol I, Proceedings, 2005, : 257 - 263
  • [6] An efficient interactive agglomerative hierarchical clustering algorithm for hyperspectral image processing
    Rahman, SA
    IMAGING SPECTROMETRY IV, 1998, 3438 : 210 - 221
  • [7] An agglomerative hierarchical clustering algorithm based on global distance measurement
    Liu, Fang
    Wei, Yongqing
    Ren, Min
    Hou, Xiuyan
    Liu, Yingying
    2015 7TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY IN MEDICINE AND EDUCATION (ITME), 2015, : 363 - 367
  • [8] Overlapping Community Discovery Algorithm Based on Hierarchical Agglomerative Clustering
    Liu, Hongtao
    Fen, Linghu
    Jian, Jie
    Chen, Long
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2018, 32 (03)
  • [9] An incremental document clustering algorithm based on a hierarchical agglomerative approach
    Joo, KH
    Lee, SJ
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, PROCEEDINGS, 2005, 3816 : 321 - 332
  • [10] Agglomerative hierarchical clustering based algorithm for network topology inference
    Zhang, Run-Sheng
    Li, Yan-Bin
    Li, Xiao-Tian
    Zhang, R.-S. (zhang_runsheng@163.com), 2013, Chinese Institute of Electronics (41): : 2346 - 2352