Efficient Hierarchical Clustering of Large High Dimensional Datasets

被引:14
|
作者
Gilpin, Sean [1 ]
Qian, Buyue [2 ]
Davidson, Ian [1 ]
机构
[1] Univ Calif Davis, Davis, CA 95616 USA
[2] IBM TJ Watson, Yorktown Hts, NY USA
关键词
Binary Codes; Hierarchical Clustering; ALGORITHM;
D O I
10.1145/2505515.2505527
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hierarchical clustering is extensively used to organize high dimensional objects such as documents and images into a structure which can then be used in a multitude of ways. However, existing algorithms are limited in their application since the time complexity of agglomerative style algorithms can be as much as O(n(2) log n) where n is the number of objects. Furthermore the computation of similarity between such objects is itself time consuming given they are high dimension and even optimized built in functions found in MATLAB take the best part of a day to handle collections of just 10,000 objects on typical machines. In this paper we explore using angular hashing to hash objects with similar angular distance to the same hash bucket. This allows us to create hierarchies of objects within each hash bucket and to hierarchically cluster the hash buckets themselves. With our formal guarantees on the similarity of objects in the same bucket this leads to an elegant agglomerative algorithm with strong performance bounds. Our experimental results show that not only is our approach thousands of times faster than regular agglomerative algorithms but surprisingly the accuracy of our results is typically as good and can sometimes be substantially better.
引用
收藏
页码:1371 / 1380
页数:10
相关论文
共 50 条
  • [1] NBC: An Efficient Hierarchical Clustering Algorithm for Large Datasets
    Zhang, Wei
    Zhang, Gongxuan
    Wang, Yongli
    Zhu, Zhaomeng
    Li, Tao
    [J]. INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2015, 9 (03) : 307 - 331
  • [2] AGRID: An efficient algorithm for clustering large high-dimensional datasets
    Zhao, YC
    Song, JD
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, 2003, 2637 : 271 - 282
  • [3] An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce
    Sun, Tianyang
    Shu, Chengchun
    Li, Feng
    Yu, Haiyan
    Ma, Lili
    Fang, Yitong
    [J]. 2009 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2009), 2009, : 494 - +
  • [4] An Efficient Density Biased Sampling Algorithm for Clustering Large High-Dimensional Datasets
    Qian, Xue-Zhong
    Deng, Jie
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (08)
  • [5] Hierarchical clustering algorithms for large datasets
    Stekh, Yuri
    Kernytskyy, Andriy
    Lobur, Mykhaylo
    [J]. TCSET 2006: MODERN PROBLEMS OF RADIO ENGINEERING, TELECOMMUNICATIONS AND COMPUTER SCIENCE, PROCEEDINGS, 2006, : 388 - 390
  • [6] NNB: An Efficient Nearest Neighbor Search Method for Hierarchical Clustering on Large Datasets
    Zhang, Wei
    Zhang, Gongxuan
    Wang, Yongli
    Zhu, Zhaomeng
    Li, Tao
    [J]. 2015 IEEE 9TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2015, : 405 - 412
  • [7] A clustering scheme for large high-dimensional document datasets
    Jiang, Jung-Yi
    Chen, Jing-Wen
    Lee, Shie-Jue
    [J]. ADVANCES IN COMPUTATION AND INTELLIGENCE, PROCEEDINGS, 2007, 4683 : 511 - 519
  • [8] Systematic Review of Clustering High-Dimensional and Large Datasets
    Pandove, Divya
    Goel, Shivani
    Rani, Rinkle
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2018, 12 (02)
  • [9] Hierarchical model-based clustering for large datasets
    Posse, C
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2001, 10 (03) : 464 - 486
  • [10] Effective data summarization for hierarchical clustering in large datasets
    Patra, Bidyut Kr.
    Nandi, Sukumar
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 42 (01) : 1 - 20