Efficient Hierarchical Clustering of Large High Dimensional Datasets

被引：14

作者：

Gilpin, Sean ^{[1
]}

Qian, Buyue ^{[2
]}

Davidson, Ian ^{[1
]}

机构：

[1] Univ Calif Davis, Davis, CA 95616 USA

[2] IBM TJ Watson, Yorktown Hts, NY USA

来源：

PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13) | 2013年

关键词：

Binary Codes; Hierarchical Clustering; ALGORITHM;

D O I：

10.1145/2505515.2505527

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Hierarchical clustering is extensively used to organize high dimensional objects such as documents and images into a structure which can then be used in a multitude of ways. However, existing algorithms are limited in their application since the time complexity of agglomerative style algorithms can be as much as O(n(2) log n) where n is the number of objects. Furthermore the computation of similarity between such objects is itself time consuming given they are high dimension and even optimized built in functions found in MATLAB take the best part of a day to handle collections of just 10,000 objects on typical machines. In this paper we explore using angular hashing to hash objects with similar angular distance to the same hash bucket. This allows us to create hierarchies of objects within each hash bucket and to hierarchically cluster the hash buckets themselves. With our formal guarantees on the similarity of objects in the same bucket this leads to an elegant agglomerative algorithm with strong performance bounds. Our experimental results show that not only is our approach thousands of times faster than regular agglomerative algorithms but surprisingly the accuracy of our results is typically as good and can sometimes be substantially better.

引用

页码：1371 / 1380

页数：10

共 50 条

[1] NBC: An Efficient Hierarchical Clustering Algorithm for Large Datasets
Zhang, Wei
Zhang, Gongxuan
Wang, Yongli
Zhu, Zhaomeng
Li, Tao
[J]. INTERNATIONAL JOURNAL OF SEMANTIC COMPUTING, 2015, 9 (03) : 307 - 331
[2] AGRID: An efficient algorithm for clustering large high-dimensional datasets
Zhao, YC
Song, JD
[J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, 2003, 2637 : 271 - 282
[3] An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce
Sun, Tianyang
Shu, Chengchun
Li, Feng
Yu, Haiyan
Ma, Lili
Fang, Yitong
[J]. 2009 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT 2009), 2009, : 494 - +
[4] An Efficient Density Biased Sampling Algorithm for Clustering Large High-Dimensional Datasets
Qian, Xue-Zhong
Deng, Jie
[J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (08)
[5] Hierarchical clustering algorithms for large datasets
Stekh, Yuri
Kernytskyy, Andriy
Lobur, Mykhaylo
[J]. TCSET 2006: MODERN PROBLEMS OF RADIO ENGINEERING, TELECOMMUNICATIONS AND COMPUTER SCIENCE, PROCEEDINGS, 2006, : 388 - 390
[6] NNB: An Efficient Nearest Neighbor Search Method for Hierarchical Clustering on Large Datasets
Zhang, Wei
Zhang, Gongxuan
Wang, Yongli
Zhu, Zhaomeng
Li, Tao
[J]. 2015 IEEE 9TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2015, : 405 - 412
[7] A clustering scheme for large high-dimensional document datasets
Jiang, Jung-Yi
Chen, Jing-Wen
Lee, Shie-Jue
[J]. ADVANCES IN COMPUTATION AND INTELLIGENCE, PROCEEDINGS, 2007, 4683 : 511 - 519
[8] Systematic Review of Clustering High-Dimensional and Large Datasets
Pandove, Divya
Goel, Shivani
Rani, Rinkle
[J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2018, 12 (02)
[9] Hierarchical model-based clustering for large datasets
Posse, C
[J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2001, 10 (03) : 464 - 486
[10] Effective data summarization for hierarchical clustering in large datasets
Patra, Bidyut Kr.
Nandi, Sukumar
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 42 (01) : 1 - 20

← 1 2 3 4 5 →