Summarizing and Mining Skewed Data Streams

被引:0
|
作者
Cormode, Graham [1 ]
Muthukrishnan, S. [1 ]
机构
[1] Rutgers State Univ, Ctr Discrete Math & Comp Sci DIMACS, Piscataway, NJ USA
关键词
data stream analysis; data mining; Zipf distribution; power laws; heavy hitters; massive data;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with epsilon accuracy and show that the space needed is only O(epsilon(-) min{1,1/z}). This is the first o(1/epsilon) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L-2 norm of the stream in o(1/epsilon(2)) space for z > 1/2, another improvement over the existing Omega(1/epsilon(2)) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.
引用
下载
收藏
页码:44 / 55
页数:12
相关论文
共 50 条
  • [31] Mining data streams using clustering
    Lu, YH
    Huang, Y
    Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 2079 - 2083
  • [32] Mining Discriminative Itemsets in Data Streams
    Seyfi, Majid
    Geva, Shlomo
    Nayak, Richi
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2014, PT I, 2014, 8786 : 125 - 134
  • [33] Log summarizing agent for web access data using data mining techniques
    Kato, H
    Hiraishi, H
    Mizoguchi, F
    JOINT 9TH IFSA WORLD CONGRESS AND 20TH NAFIPS INTERNATIONAL CONFERENCE, PROCEEDINGS, VOLS. 1-5, 2001, : 2642 - 2647
  • [34] Exploring Data Mining Techniques in Medical Data Streams
    Sun, Le
    Ma, Jiangang
    Zhang, Yanchun
    Wang, Hua
    DATABASES THEORY AND APPLICATIONS, (ADC 2016), 2016, 9877 : 321 - 332
  • [35] Time-decaying bloom filters for data streams with skewed distributions
    Cheng, K
    Xiang, LM
    Iwaihara, M
    Xu, HY
    Mohania, MM
    15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications, Proceedings, 2005, : 63 - 69
  • [36] Accurate Quantile Estimation for Skewed Data Streams Using Nonlinear Interpolation
    Liu, Jun
    Zheng, Wenyao
    Lin, Zheng
    Lin, Nan
    IEEE ACCESS, 2018, 6 : 28438 - 28446
  • [37] Discussion on Fast and Accurate Sketches for Skewed Data Streams: A Case Study
    Sun, Shuhao
    Li, Dagang
    WEB AND BIG DATA (APWEB-WAIM 2018), PT II, 2018, 10988 : 75 - 89
  • [38] Mining skewed and sparse transaction data for personalized shopping recommendation
    Hsu, CN
    Chung, HH
    Huang, HS
    MACHINE LEARNING, 2004, 57 (1-2) : 35 - 59
  • [39] Mining Skewed and Sparse Transaction Data for Personalized Shopping Recommendation
    Chun-Nan Hsu
    Hao-Hsiang Chung
    Han-Shen Huang
    Machine Learning, 2004, 57 : 35 - 59
  • [40] Mining Data Streams with Dynamic Confidence Intervals
    Trabold, Daniel
    Horvath, Tamas
    BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2016, 2016, 9829 : 99 - 113