A Framework for Clustering Massive-Domain Data Streams

被引:11
|
作者
Aggarwal, Charu C. [1 ]
机构
[1] IBM TJ Watson Res Ctr, Hawthorne, NY 10532 USA
关键词
D O I
10.1109/ICDE.2009.13
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we will examine the problem of clustering massive domain data streams. Massive-domain data streams are those in which the number of possible domain values for each attribute are very large and cannot be easily tracked for clustering purposes. Some examples of such streams include IP-address streams, credit-card transaction streams, or streams of sales data over large numbers of items. In such cases, it is well known that even simple stream operations such as counting can be extremely difficult because of the difficulty in maintaining summary information over the different discrete values. The task of clustering is significantly more challenging in such cases, since the intermediate statistics for the different clusters cannot be maintained efficiently. In this paper, we propose a method for clustering massive-domain data streams with the use of sketches. We prove probabilistic results which show that a sketch-based clustering method can provide similar results to an infinite-space clustering algorithm with high probability. We present experimental results which validate these theoretical results, and show that it is possible to approximate the behavior of an infinite-space algorithm accurately.
引用
收藏
页码:102 / 113
页数:12
相关论文
共 50 条
  • [1] A Framework for Clustering Massive Text and Categorical Data Streams
    Aggarwal, Charu C.
    Yu, Philip S.
    [J]. PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 479 - 483
  • [2] A framework for clustering massive graph streams
    Aggarwal, Charu C.
    Zhao, Yuchen
    Yu, Philip S.
    [J]. Statistical Analysis and Data Mining, 2010, 3 (06): : 399 - 416
  • [3] On clustering massive text and categorical data streams
    Charu C. Aggarwal
    Philip S. Yu
    [J]. Knowledge and Information Systems, 2010, 24 : 171 - 196
  • [4] On clustering massive text and categorical data streams
    Aggarwal, Charu C.
    Yu, Philip S.
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 24 (02) : 171 - 196
  • [5] An Adaptive Framework for Clustering Data Streams
    Chandrika
    Kumar, K. R. Ananda
    [J]. ADVANCES IN COMPUTING AND COMMUNICATIONS, PT I, 2011, 190 : 704 - +
  • [6] A general framework for mining massive data streams
    Domingos, P
    Hulten, G
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2003, 12 (04) : 945 - 949
  • [7] Scaling clustering algorithms for massive data sets using data streams
    Nittel, S
    Leung, KT
    Braverman, A
    [J]. 20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, : 830 - 830
  • [8] A Framework for Classification and Segmentation of Massive Audio Data Streams
    Aggarwal, Charu C.
    [J]. KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 1013 - 1017
  • [9] Clustering massive text data streams by semantic smoothing model
    Liu, Yubao
    Cai, Jiarong
    Yin, Jian
    Fu, Ada Wai-Chee
    [J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2007, 4632 : 389 - +
  • [10] A Clustering-based Framework for Classifying Data Streams
    Yan, Xuyang
    Homaifar, Abdollah
    Sarkar, Mrinmoy
    Girma, Abenezer
    Tunstel, Edward
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 3257 - 3263