Clustering in very large databases based on distance and density

被引:29
|
作者
Qian, WN [1 ]
Gong, XQ [1 ]
Zhou, AY [1 ]
机构
[1] Fudan Univ, Dept Comp Sci & Engn, Lab Intelligent Informat Proc, Shanghai 200433, Peoples R China
来源
关键词
data mining; very large database; clustering;
D O I
10.1007/BF02946652
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Clustering in very large databases or data warehouses, with many applications in areas such as spatial computation, web information collection, pattern recognition and economic analysis, is a huge task that challenges data mining researches. Current clustering methods always have the problems: 1) scanning the whole database leads to high I/O cost and expensive maintenance (e.g., R*-tree); 2) pre-specifying the uncertain parameter k, with which clustering can only be refined by trial and test many times; 3) lacking high efficiency in treating arbitrary shape under very large data set environment. In this paper, we first present a new hybrid-clustering algorithm to solve these problems. This new algorithm, which combines both distance and density strategies, can handle any arbitrary shape clusters effectively. It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality. Furthermore, this algorithm can easily eliminate noises and identify outliers. An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms (CURE and DBSCAN). The results show that our algorithm outperforms them in terms of efficiency and cost, and even gets much more speedup as the data size scales up much larger.
引用
收藏
页码:67 / 76
页数:10
相关论文
共 50 条
  • [1] Clustering in very large databases based on distance and density
    Weining Qian
    XueQing Gong
    AoYing Zhou
    [J]. Journal of Computer Science and Technology, 2003, 18 : 67 - 76
  • [2] An efficient density based clustering algorithm for large databases
    El-Sonbaty, Y
    Ismail, MA
    Farouk, M
    [J]. ICTAI 2004: 16TH IEEE INTERNATIONALCONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, : 673 - 677
  • [3] Clustering and validation for very large databases (VLDB)
    Momin, Bashirahamad Fardin
    [J]. 2006 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2007, : 258 - 263
  • [4] A fast density-based clustering algorithm for large databases
    Liu, Bing
    [J]. PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2006, : 996 - 1000
  • [5] Short documents clustering in very large text databases
    Wang, Yongheng
    Jia, Yan
    Yang, Shuqiang
    [J]. WEB INFORMATION SYSTEMS - WISE 2006 WORKSHOPS, PROCEEDINGS, 2006, 4256 : 83 - 93
  • [6] Hybridized Fragmentation of Very Large Databases Using Clustering
    Harikumar, Sandhya
    Ramachandran, Raji
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2015,
  • [7] Scalable grid-based clustering algorithm for very large spatial databases
    Sun, Yufen
    Lu, Yansheng
    [J]. 2006 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PTS 1 AND 2, PROCEEDINGS, 2006, : 763 - 768
  • [8] WINP: A window-based incremental and parallel clustering algorithm for very large databases
    Qiang, Z
    Zheng, Z
    Wei, SZ
    Daley, E
    [J]. ICTAI 2005: 17TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, : 169 - 176
  • [9] WaveCluster:: a wavelet-based clustering approach for spatial data in very large databases
    Sheikholeslami, G
    Chatterjee, S
    Zhang, AD
    [J]. VLDB JOURNAL, 2000, 8 (3-4): : 289 - 304
  • [10] WaveCluster: a wavelet-based clustering approach for spatial data in very large databases
    Gholamhosein Sheikholeslami
    Surojit Chatterjee
    Aidong Zhang
    [J]. The VLDB Journal, 2000, 8 : 289 - 304