Scalable Hierarchical Agglomerative Clustering

被引:20
|
作者
Monath, Nicholas [2 ,5 ]
Dubey, Kumar Avinava [1 ]
Guruganesh, Guru [1 ]
Zaheer, Manzil [1 ]
Ahmed, Amr [1 ]
McCallum, Andrew [2 ]
Mergen, Gokhan [1 ]
Najork, Marc [1 ]
Terzihan, Mert [4 ,5 ]
Tjanaka, Bryon [3 ,5 ]
Wang, Yuan [1 ]
Wu, Yuchen [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
[2] Univ Massachusetts, Amherst, MA 01003 USA
[3] Univ Southern Calif, Los Angeles, CA 90007 USA
[4] Facebook, Menlo Pk, CA USA
[5] Google, Mountain View, CA 94043 USA
基金
美国国家科学基金会;
关键词
Clustering; Hierarchical Clustering;
D O I
10.1145/3447548.3467404
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective [32]. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
引用
收藏
页码:1245 / 1255
页数:11
相关论文
共 50 条
  • [21] Resolving the structure of interactomes with hierarchical agglomerative clustering
    Park, Yongjin
    Bader, Joel S.
    BMC BIOINFORMATICS, 2011, 12
  • [22] Asymmetric agglomerative hierarchical clustering algorithms and their evaluations
    Takeuchi, Akinobu
    Saito, Takayuki
    Yadohisa, Hiroshi
    JOURNAL OF CLASSIFICATION, 2007, 24 (01) : 123 - 143
  • [23] Empirical Comparison of Distances for Agglomerative Hierarchical Clustering
    Tsumoto, Shusaku
    Kimura, Tomohiro
    Iwata, Haruko
    Hirano, Shoji
    INFORMATION PROCESSING AND MANAGEMENT OF UNCERTAINTY IN KNOWLEDGE-BASED SYSTEMS: THEORY AND FOUNDATIONS, PT II, 2018, 854 : 538 - 548
  • [24] Asymmetric Agglomerative Hierarchical Clustering Algorithms and Their Evaluations
    Akinobu Takeuchi
    Takayuki Saito
    Hiroshi Yadohisa
    Journal of Classification, 2007, 24 : 123 - 143
  • [25] Empirical Comparison of Similarities for Agglomerative Hierarchical Clustering
    Tsumoto, Shusaku
    Hirano, Shoji
    Kimura, Tomohiro
    Iwata, Haruko
    2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 3405 - 3410
  • [26] Customer Segmentation Using Hierarchical Agglomerative Clustering
    Phan Duy Hung
    Nguyen Thi Thuy Lien
    Nguyen Duc Ngoc
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND SYSTEMS (ICISS 2019), 2019, : 33 - 37
  • [27] Competence maps using agglomerative hierarchical clustering
    Barirani, Ahmad
    Agard, Bruno
    Beaudry, Catherine
    JOURNAL OF INTELLIGENT MANUFACTURING, 2013, 24 (02) : 373 - 384
  • [28] Hesitant fuzzy agglomerative hierarchical clustering algorithms
    Zhang, Xiaolu
    Xu, Zeshui
    INTERNATIONAL JOURNAL OF SYSTEMS SCIENCE, 2015, 46 (03) : 562 - 576
  • [29] Agglomerative hierarchical clustering for nonlinear data analysis
    Wattanachon, U
    Lursinsap, C
    2004 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN & CYBERNETICS, VOLS 1-7, 2004, : 1420 - 1425
  • [30] Parallel Hierarchical Agglomerative Clustering for fMRI Data
    Angeletti, Melodie
    Bonny, Jean-Marie
    Durif, Franck
    Koko, Jonas
    PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT I, 2018, 10777 : 265 - 275