On clustering massive text and categorical data streams

被引:55
|
作者
Aggarwal, Charu C. [1 ]
Yu, Philip S. [2 ]
机构
[1] IBM TJ Watson Res Ctr, Hawthorne, NY 10532 USA
[2] Univ Illinois, Chicago, IL USA
关键词
Stream clustering; Text clustering; Text streams; Text stream clustering; Categorical data;
D O I
10.1007/s10115-009-0241-z
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.
引用
收藏
页码:171 / 196
页数:26
相关论文
共 50 条
  • [1] On clustering massive text and categorical data streams
    Charu C. Aggarwal
    Philip S. Yu
    [J]. Knowledge and Information Systems, 2010, 24 : 171 - 196
  • [2] A Framework for Clustering Massive Text and Categorical Data Streams
    Aggarwal, Charu C.
    Yu, Philip S.
    [J]. PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 479 - 483
  • [3] Clustering categorical data streams
    He, Zengyou
    Xu, Xiaofei
    Deng, Shengchun
    Huang, Joshua Zhexue
    [J]. JOURNAL OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING, 2011, 11 (04) : 185 - 192
  • [4] Clustering massive text data streams by semantic smoothing model
    Liu, Yubao
    Cai, Jiarong
    Yin, Jian
    Wai-Chee Fu, Ada
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2007, 4632 : 389 - 400
  • [5] Clustering massive text data streams by semantic smoothing model
    Liu, Yubao
    Cai, Jiarong
    Yin, Jian
    Fu, Ada Wai-Chee
    [J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2007, 4632 : 389 - +
  • [6] Clustering Text Data Streams
    刘玉葆
    蔡嘉荣
    印鉴
    傅蔚慈
    [J]. Journal of Computer Science & Technology, 2008, 23 (01) : 112 - 128
  • [7] Clustering text data streams
    Liu, Yu-Bao
    Cai, Jia-Rong
    Yin, Jian
    Fu, Ada Wai-Chee
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2008, 23 (01) : 112 - 128
  • [8] Clustering Text Data Streams
    Yu-Bao Liu
    Jia-Rong Cai
    Jian Yin
    Ada Wai-Chee Fu
    [J]. Journal of Computer Science and Technology, 2008, 23 : 112 - 128
  • [9] Detecting the Change of Clustering Structure in Categorical Data Streams
    Chen, Keke
    Liu, Ling
    [J]. PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 504 - 508
  • [10] SCLOPE: An algorithm for clustering data streams of categorical attributes
    Ong, KL
    Li, WY
    Ng, WK
    Lim, EP
    [J]. DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2004, 3181 : 209 - 218