Clustering in very large databases based on distance and density

被引：29

作者：

Qian, WN ^{[1
]}

Gong, XQ ^{[1
]}

Zhou, AY ^{[1
]}

机构：

[1] Fudan Univ, Dept Comp Sci & Engn, Lab Intelligent Informat Proc, Shanghai 200433, Peoples R China

来源：

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY | 2003年 / 18卷 / 01期

关键词：

data mining; very large database; clustering;

D O I：

10.1007/BF02946652

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Clustering in very large databases or data warehouses, with many applications in areas such as spatial computation, web information collection, pattern recognition and economic analysis, is a huge task that challenges data mining researches. Current clustering methods always have the problems: 1) scanning the whole database leads to high I/O cost and expensive maintenance (e.g., R*-tree); 2) pre-specifying the uncertain parameter k, with which clustering can only be refined by trial and test many times; 3) lacking high efficiency in treating arbitrary shape under very large data set environment. In this paper, we first present a new hybrid-clustering algorithm to solve these problems. This new algorithm, which combines both distance and density strategies, can handle any arbitrary shape clusters effectively. It makes full use of statistics information in mining to reduce the time complexity greatly while keeping good clustering quality. Furthermore, this algorithm can easily eliminate noises and identify outliers. An experimental evaluation is performed on a spatial database with this method and other popular clustering algorithms (CURE and DBSCAN). The results show that our algorithm outperforms them in terms of efficiency and cost, and even gets much more speedup as the data size scales up much larger.

引用

页码：67 / 76

页数：10

共 50 条

[1] Clustering in very large databases based on distance and density
Weining Qian
XueQing Gong
AoYing Zhou
[J]. Journal of Computer Science and Technology, 2003, 18 : 67 - 76
[2] An efficient density based clustering algorithm for large databases
El-Sonbaty, Y
Ismail, MA
Farouk, M
[J]. ICTAI 2004: 16TH IEEE INTERNATIONALCONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, : 673 - 677
[3] Clustering and validation for very large databases (VLDB)
Momin, Bashirahamad Fardin
[J]. 2006 INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2007, : 258 - 263
[4] A fast density-based clustering algorithm for large databases
Liu, Bing
[J]. PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2006, : 996 - 1000
[5] Short documents clustering in very large text databases
Wang, Yongheng
Jia, Yan
Yang, Shuqiang
[J]. WEB INFORMATION SYSTEMS - WISE 2006 WORKSHOPS, PROCEEDINGS, 2006, 4256 : 83 - 93
[6] Hybridized Fragmentation of Very Large Databases Using Clustering
Harikumar, Sandhya
Ramachandran, Raji
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2015,
[7] Scalable grid-based clustering algorithm for very large spatial databases
Sun, Yufen
Lu, Yansheng
[J]. 2006 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PTS 1 AND 2, PROCEEDINGS, 2006, : 763 - 768
[8] WINP: A window-based incremental and parallel clustering algorithm for very large databases
Qiang, Z
Zheng, Z
Wei, SZ
Daley, E
[J]. ICTAI 2005: 17TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2005, : 169 - 176
[9] WaveCluster:: a wavelet-based clustering approach for spatial data in very large databases
Sheikholeslami, G
Chatterjee, S
Zhang, AD
[J]. VLDB JOURNAL, 2000, 8 (3-4): : 289 - 304
[10] WaveCluster: a wavelet-based clustering approach for spatial data in very large databases
Gholamhosein Sheikholeslami
Surojit Chatterjee
Aidong Zhang
[J]. The VLDB Journal, 2000, 8 : 289 - 304

← 1 2 3 4 5 →