Clustering high dimensional massive scientific datasets

被引：2

作者：

Otoo, EJ

Shoshani, A

Hwang, SW

机构：

[1] Univ Calif Berkeley, Lawrence Berkeley Lab, Berkeley, CA 94720 USA

[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA

来源：

JOURNAL OF INTELLIGENT INFORMATION SYSTEMS | 2001年 / 17卷 / 2-3期

关键词：

clustering; high dimensional; scientific; datasets;

D O I：

10.1023/A:1012853629322

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(MN) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorihm that is linear in the data size.

引用

页码：147 / 168

页数：22

共 50 条

[1] Clustering High Dimensional Massive Scientific Datasets
Ekow J. Otoo
Arie Shoshani
Seung-Won Hwang
[J]. Journal of Intelligent Information Systems, 2001, 17 : 147 - 168
[2] Clustering high dimensional massive scientific datasets
Otoo, EJ
Shoshani, A
Hwang, S
[J]. THIRTEENTH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2001, : 147 - 157
[3] Joining massive high-dimensional datasets
Kahveci, T
Lang, CA
Singh, AK
[J]. 19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 265 - 276
[4] An Improved K-means Clustering Algorithm Applicable to Massive High-dimensional Matrix Datasets
Li, Dong-Yuan
Cao, Cai-Feng
[J]. 2017 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (IST 2017), 2017, 11
[5] Efficient Hierarchical Clustering of Large High Dimensional Datasets
Gilpin, Sean
Qian, Buyue
Davidson, Ian
[J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1371 - 1380
[6] Robust clustering of massive tractography datasets
Guevara, P.
Poupon, C.
Riviere, D.
Cointepas, Y.
Descoteaux, M.
Thirion, B.
Mangin, J. -F.
[J]. NEUROIMAGE, 2011, 54 (03) : 1975 - 1993
[7] A general framework for clustering high-dimensional datasets
Zhao, YC
Junde, S
[J]. CCECE 2003: CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-3, PROCEEDINGS: TOWARD A CARING AND HUMANE TECHNOLOGY, 2003, : 1091 - 1094
[8] Parallel social spider clustering algorithm for high dimensional datasets
Shukla, Urvashi Prakash
Nanda, Satyasai Jagannath
[J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2016, 56 : 75 - 90
[9] A clustering scheme for large high-dimensional document datasets
Jiang, Jung-Yi
Chen, Jing-Wen
Lee, Shie-Jue
[J]. ADVANCES IN COMPUTATION AND INTELLIGENCE, PROCEEDINGS, 2007, 4683 : 511 - 519
[10] Systematic Review of Clustering High-Dimensional and Large Datasets
Pandove, Divya
Goel, Shivani
Rani, Rinkle
[J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2018, 12 (02)

← 1 2 3 4 5 →