Scalable Hierarchical Agglomerative Clustering

被引:20
|
作者
Monath, Nicholas [2 ,5 ]
Dubey, Kumar Avinava [1 ]
Guruganesh, Guru [1 ]
Zaheer, Manzil [1 ]
Ahmed, Amr [1 ]
McCallum, Andrew [2 ]
Mergen, Gokhan [1 ]
Najork, Marc [1 ]
Terzihan, Mert [4 ,5 ]
Tjanaka, Bryon [3 ,5 ]
Wang, Yuan [1 ]
Wu, Yuchen [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
[2] Univ Massachusetts, Amherst, MA 01003 USA
[3] Univ Southern Calif, Los Angeles, CA 90007 USA
[4] Facebook, Menlo Pk, CA USA
[5] Google, Mountain View, CA 94043 USA
基金
美国国家科学基金会;
关键词
Clustering; Hierarchical Clustering;
D O I
10.1145/3447548.3467404
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective [32]. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method's scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art.
引用
收藏
页码:1245 / 1255
页数:11
相关论文
共 50 条
  • [41] Agglomerative hierarchical clustering with constraints: Theoretical and empirical results
    Davidson, I
    Ravi, SS
    KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2005, 2005, 3721 : 59 - 70
  • [42] Discovering Program Topoi via Hierarchical Agglomerative Clustering
    Ieva, Carlo
    Gotlieb, Arnaud
    Kaci, Souhila
    Lazaar, Nadjib
    IEEE TRANSACTIONS ON RELIABILITY, 2018, 67 (03) : 758 - 770
  • [44] Rough Entropy Hierarchical Agglomerative Clustering in Image Segmentation
    Malyszko, Dariusz
    Stepaniuk, Jaroslaw
    TRANSACTIONS ON ROUGH SETS XIII, 2011, 6499 : 89 - 103
  • [45] Model Order Reduction Based on Agglomerative Hierarchical Clustering
    Al-Dabooni, Seaar
    Wunsch, Donald
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2019, 30 (06) : 1881 - 1895
  • [46] Application of Online Agglomerative Hierarchical Clustering on real dMRI
    Demir, Ali
    Ozkan, Mehmed
    2014 18TH NATIONAL BIOMEDICAL ENGINEERING MEETING (BIYOMUT), 2014,
  • [47] A Degenerate Agglomerative Hierarchical Clustering Algorithm for Community Detection
    Fiscarelli, Antonio Maria
    Beliakov, Aleksandr
    Konchenko, Stanislav
    Bouvry, Pascal
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2018, PT I, 2018, 10751 : 234 - 242
  • [48] A Comparative Study of Divisive and Agglomerative Hierarchical Clustering Algorithms
    Maurice Roux
    Journal of Classification, 2018, 35 : 345 - 366
  • [49] The impact of isolation kernel on agglomerative hierarchical clustering algorithms
    Han, Xin
    Zhu, Ye
    Ting, Kai Ming
    Li, Gang
    PATTERN RECOGNITION, 2023, 139
  • [50] Agglomerative hierarchical clustering technique for partitioning patent dataset
    Smarika
    Mattas, Nisha
    Kalra, Parul
    Mehrotra, Deepti
    2015 4TH INTERNATIONAL CONFERENCE ON RELIABILITY, INFOCOM TECHNOLOGIES AND OPTIMIZATION (ICRITO) (TRENDS AND FUTURE DIRECTIONS), 2015,