Clustering high dimensional massive scientific datasets

被引:2
|
作者
Otoo, EJ
Shoshani, A
Hwang, SW
机构
[1] Univ Calif Berkeley, Lawrence Berkeley Lab, Berkeley, CA 94720 USA
[2] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
关键词
clustering; high dimensional; scientific; datasets;
D O I
10.1023/A:1012853629322
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many scientific applications can benefit from an efficient clustering algorithm of massively large high dimensional datasets. However most of the developed algorithms are impractical to use when the amount of data is very large. Given N objects each defined by an M-dimensional feature vector, any clustering technique for handling very large datasets in high dimensional space should run in time O(MN) at best, and O(MN log N) in the worst case, using no more than O(MN) storage, for it to be practical. We introduce a hybrid algorithm, called HyCeltyc, for clustering massively large high dimensional datasets in O(MN) time which is linear in the size of the data. HyCeltyc, which stands for Hybrid Cell Density Clustering method, combines a cell-density based algorithm with a hierarchical agglomerative method to identify clusters in linear time. The main steps of the algorithm involve sampling, dimensionality reduction, selection of significant features on which to cluster the data and a grid-based clustering algorihm that is linear in the data size.
引用
收藏
页码:147 / 168
页数:22
相关论文
共 50 条
  • [1] Clustering High Dimensional Massive Scientific Datasets
    Ekow J. Otoo
    Arie Shoshani
    Seung-Won Hwang
    [J]. Journal of Intelligent Information Systems, 2001, 17 : 147 - 168
  • [2] Clustering high dimensional massive scientific datasets
    Otoo, EJ
    Shoshani, A
    Hwang, S
    [J]. THIRTEENTH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 2001, : 147 - 157
  • [3] Joining massive high-dimensional datasets
    Kahveci, T
    Lang, CA
    Singh, AK
    [J]. 19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 265 - 276
  • [4] An Improved K-means Clustering Algorithm Applicable to Massive High-dimensional Matrix Datasets
    Li, Dong-Yuan
    Cao, Cai-Feng
    [J]. 2017 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (IST 2017), 2017, 11
  • [5] Efficient Hierarchical Clustering of Large High Dimensional Datasets
    Gilpin, Sean
    Qian, Buyue
    Davidson, Ian
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1371 - 1380
  • [6] Robust clustering of massive tractography datasets
    Guevara, P.
    Poupon, C.
    Riviere, D.
    Cointepas, Y.
    Descoteaux, M.
    Thirion, B.
    Mangin, J. -F.
    [J]. NEUROIMAGE, 2011, 54 (03) : 1975 - 1993
  • [7] A general framework for clustering high-dimensional datasets
    Zhao, YC
    Junde, S
    [J]. CCECE 2003: CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, VOLS 1-3, PROCEEDINGS: TOWARD A CARING AND HUMANE TECHNOLOGY, 2003, : 1091 - 1094
  • [8] Parallel social spider clustering algorithm for high dimensional datasets
    Shukla, Urvashi Prakash
    Nanda, Satyasai Jagannath
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2016, 56 : 75 - 90
  • [9] A clustering scheme for large high-dimensional document datasets
    Jiang, Jung-Yi
    Chen, Jing-Wen
    Lee, Shie-Jue
    [J]. ADVANCES IN COMPUTATION AND INTELLIGENCE, PROCEEDINGS, 2007, 4683 : 511 - 519
  • [10] Systematic Review of Clustering High-Dimensional and Large Datasets
    Pandove, Divya
    Goel, Shivani
    Rani, Rinkle
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2018, 12 (02)