Scalable Hierarchical Agglomerative Clustering

被引:20
|
作者
Monath, Nicholas [2 ,5 ]
Dubey, Kumar Avinava [1 ]
Guruganesh, Guru [1 ]
Zaheer, Manzil [1 ]
Ahmed, Amr [1 ]
McCallum, Andrew [2 ]
Mergen, Gokhan [1 ]
Najork, Marc [1 ]
Terzihan, Mert [4 ,5 ]
Tjanaka, Bryon [3 ,5 ]
Wang, Yuan [1 ]
Wu, Yuchen [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
[2] Univ Massachusetts, Amherst, MA 01003 USA
[3] Univ Southern Calif, Los Angeles, CA 90007 USA
[4] Facebook, Menlo Pk, CA USA
[5] Google, Mountain View, CA 94043 USA
基金
美国国家科学基金会;
关键词
Clustering; Hierarchical Clustering;
D O I
10.1145/3447548.3467404
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective [32]. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
引用
收藏
页码:1245 / 1255
页数:11
相关论文
共 50 条
  • [1] HIERARCHICAL AGGLOMERATIVE CLUSTERING PROCEDURE
    LUKASOVA, A
    PATTERN RECOGNITION, 1979, 11 (5-6) : 365 - 381
  • [2] Efficient agglomerative hierarchical clustering
    Bouguettaya, Athman
    Yu, Qi
    Liu, Xumin
    Zhou, Xiangmin
    Song, Andy
    EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (05) : 2785 - 2797
  • [3] Agglomerative hierarchical clustering for data with tolerance
    Yasunori, Endo
    Yukihiro, Hamasuna
    Sadaaki, Miyamoto
    GRC: 2007 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, PROCEEDINGS, 2007, : 404 - 409
  • [4] Hierarchical subtrees agglomerative clustering algorithms
    Beijing Municipal Key Laboratory of Multimedia and Intelligent Software Technology, College of Computer Science and Technology, Beijing University of Technology, Beijing 100022, China
    Beijing Gongye Daxue Xuebao J. Beijing Univ. Technol., 2006, 5 (442-446):
  • [5] Fair Algorithms for Hierarchical Agglomerative Clustering
    Chhabra, Anshuman
    Mohapatra, Prasant
    2022 21ST IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, ICMLA, 2022, : 206 - 211
  • [6] Agglomerative and divisive hierarchical Bayesian clustering
    Burghardt, Elliot
    Sewell, Daniel
    Cavanaugh, Joseph
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2022, 176
  • [7] Geometric algorithms for agglomerative hierarchical clustering
    Chen, DZ
    Xu, B
    COMPUTING AND COMBINATORICS, PROCEEDINGS, 2003, 2697 : 30 - 39
  • [8] Order preserving hierarchical agglomerative clustering
    Bakkelund, Daniel
    MACHINE LEARNING, 2022, 111 (05) : 1851 - 1901
  • [9] Hierarchical Agglomerative Clustering with Ordering Constraints
    Zhao, Haifeng
    Qi, ZiJie
    THIRD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING: WKDD 2010, PROCEEDINGS, 2010, : 195 - 199
  • [10] Learning the threshold in hierarchical agglomerative clustering
    Daniels, Kristine
    Giraud-Carrier, Christophe
    ICMLA 2006: 5TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2006, : 270 - +