Summarizing and Mining Skewed Data Streams

被引:0
|
作者
Cormode, Graham [1 ]
Muthukrishnan, S. [1 ]
机构
[1] Rutgers State Univ, Ctr Discrete Math & Comp Sci DIMACS, Piscataway, NJ USA
关键词
data stream analysis; data mining; Zipf distribution; power laws; heavy hitters; massive data;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with epsilon accuracy and show that the space needed is only O(epsilon(-) min{1,1/z}). This is the first o(1/epsilon) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L-2 norm of the stream in o(1/epsilon(2)) space for z > 1/2, another improvement over the existing Omega(1/epsilon(2)) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.
引用
收藏
页码:44 / 55
页数:12
相关论文
共 50 条
  • [1] Mining Data Streams with Skewed Distribution by Static Classifier Ensemble
    Wang, Yi
    Zhang, Yang
    Wang, Yong
    [J]. OPPORTUNITIES AND CHALLENGES FOR NEXT-GENERATION APPLIED INTELLIGENCE, 2009, 214 : 65 - 71
  • [2] A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
    Gao, Jing
    Fan, Wei
    Han, Jiawei
    Yu, Philip S.
    [J]. PROCEEDINGS OF THE SEVENTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 3 - +
  • [3] Summarizing distributed data streams for storage in data warehouses
    Chiky, Raja
    Hebrail, Georges
    [J]. DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2008, 5182 : 65 - 74
  • [4] Accurate Quantile Estimation for Skewed Data Streams
    Lin, Zheng
    Liu, Jun
    Lin, Nan
    [J]. 2017 IEEE 28TH ANNUAL INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR, AND MOBILE RADIO COMMUNICATIONS (PIMRC), 2017,
  • [5] A Probabilistic Sketch for Summarizing Cold Items of Data Streams
    Liu, Yongqiang
    Xie, Xike
    [J]. IEEE-ACM TRANSACTIONS ON NETWORKING, 2024, 32 (02) : 1287 - 1302
  • [6] Efficiently Summarizing Data Streams over Sliding Windows
    Rivetti, Nicolo
    Busnel, Yann
    Mostefaoui, Achour
    [J]. 2015 IEEE 14th International Symposium on Network Computing and Applications (NCA), 2015, : 151 - 158
  • [7] Summarizing order statistics over data streams with duplicates
    Zhang, Ying
    Lin, Xuemin
    Yuan, Yidong
    Kitsuregawa, Masaru
    Zhou, Xiaofang
    Yu, Jeffrey Xu
    [J]. 2007 IEEE 23RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2007, : 1304 - +
  • [8] Summarizing Spatial Data Streams Using Cluster Hulls
    Hershberger, John
    Shrivastava, Nisheeth
    Suri, Subhash
    [J]. PROCEEDINGS OF THE EIGHTH WORKSHOP ON ALGORITHM ENGINEERING AND EXPERIMENTS AND THE THIRD WORKSHOP ON ANALYTIC ALGORITHMICS AND COMBINATORICS, 2006, : 26 - +
  • [9] Summarizing and Mining Streaming Data via a Functional Data Approach
    Balzanella, Antonio
    Romano, Elvira
    Verde, Rosanna
    [J]. CLASSIFICATION AND MULTIVARIATE ANALYSIS FOR COMPLEX DATA STRUCTURES, 2011, : 409 - +
  • [10] Summarizing numeric spatial data streams by trend cluster discovery
    Appice, Annalisa
    Ciampi, Anna
    Malerba, Donato
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2015, 29 (01) : 84 - 136