Scalable Hierarchical Agglomerative Clustering

被引:20
|
作者
Monath, Nicholas [2 ,5 ]
Dubey, Kumar Avinava [1 ]
Guruganesh, Guru [1 ]
Zaheer, Manzil [1 ]
Ahmed, Amr [1 ]
McCallum, Andrew [2 ]
Mergen, Gokhan [1 ]
Najork, Marc [1 ]
Terzihan, Mert [4 ,5 ]
Tjanaka, Bryon [3 ,5 ]
Wang, Yuan [1 ]
Wu, Yuchen [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
[2] Univ Massachusetts, Amherst, MA 01003 USA
[3] Univ Southern Calif, Los Angeles, CA 90007 USA
[4] Facebook, Menlo Pk, CA USA
[5] Google, Mountain View, CA 94043 USA
基金
美国国家科学基金会;
关键词
Clustering; Hierarchical Clustering;
D O I
10.1145/3447548.3467404
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective [32]. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
引用
收藏
页码:1245 / 1255
页数:11
相关论文
共 50 条
  • [31] Application of Agglomerative Hierarchical Clustering for Clustering of Time Series Data
    Radovanovic, Ana
    Li, Junshi
    Milanovic, Jovica, V
    Milosavljevic, Nina
    Storchi, Riccardo
    2020 IEEE PES INNOVATIVE SMART GRID TECHNOLOGIES EUROPE (ISGT-EUROPE 2020): SMART GRIDS: KEY ENABLERS OF A GREEN POWER SYSTEM, 2020, : 640 - 644
  • [32] Semantic Clustering of Functional Requirements Using Agglomerative Hierarchical Clustering
    Salman, Hamzeh Eyal
    Hammad, Mustafa
    Seriai, Abdelhak-Djamel
    Al-Sbou, Ahed
    INFORMATION, 2018, 9 (09)
  • [33] Energy-efficient scalable routing algorithm based on hierarchical agglomerative clustering for Wireless Sensor Networks
    Chai, Xuguang
    Wu, Yalin
    Feng, Lei
    ALEXANDRIA ENGINEERING JOURNAL, 2025, 120 : 95 - 105
  • [34] Development of an efficient hierarchical clustering analysis using an agglomerative clustering algorithm
    Naeem, Arshia
    Rehman, Mariam
    Anjum, Maria
    Asif, Muhammad
    CURRENT SCIENCE, 2019, 117 (06): : 1045 - 1053
  • [35] Hierarchical Agglomerative Clustering of Time-Warped Series
    Kotas, Marian
    Leski, Jacek
    Moron, Tomasz
    Guzman, Jader Giraldo
    MAN-MACHINE INTERACTIONS 5, ICMMI 2017, 2018, 659 : 207 - 216
  • [36] Online Agglomerative Hierarchical Clustering of Neural Fiber Tracts
    Demir, Ali
    Mohamed, Ashraf
    Cetingul, H. Ertan
    2013 35TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2013, : 85 - 88
  • [37] A Secure Distributed Framework for Agglomerative Hierarchical Clustering Construction
    Hamidi, Mona
    Sheikhalishahi, Mina
    Martinelli, Fabio
    2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), 2018, : 430 - 435
  • [38] AHSCAN: Agglomerative Hierarchical Structural Clustering Algorithm for Networks
    Yuruk, Nurcan
    Mete, Mutlu
    Xu, Xiaowei
    Schweiger, Thomas A. J.
    2009 INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING, 2009, : 72 - +
  • [39] An agglomerative hierarchical approach to visualization in Bayesian clustering problems
    Dawson, K. J.
    Belkhir, K.
    HEREDITY, 2009, 103 (01) : 32 - 45
  • [40] Defining Hydrogeological Site Similarity with Hierarchical Agglomerative Clustering
    Kawa, Nura
    Cucchi, Karina
    Rubin, Yoram
    Attinger, Sabine
    Hesse, Falk
    GROUNDWATER, 2023, 61 (04) : 563 - 573