A general framework for mining massive data streams

被引:89
|
作者
Domingos, P [1 ]
Hulten, G [1 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
data mining; Hoeffding bounds; machine learning; scalability; subsampling;
D O I
10.1198/1061860032544
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional "one-shot" data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.
引用
收藏
页码:945 / 949
页数:5
相关论文
共 50 条
  • [41] Decision trees for mining data streams
    Gama, Joao
    Fernandes, Ricardo
    Rocha, Ricardo
    INTELLIGENT DATA ANALYSIS, 2006, 10 (01) : 23 - 45
  • [42] Summarizing and Mining Skewed Data Streams
    Cormode, Graham
    Muthukrishnan, S.
    PROCEEDINGS OF THE FIFTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2005, : 44 - 55
  • [43] Ensemble classifier for mining data streams
    Czarnowski, Ireneusz
    Jedrzejowicz, Piotr
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS 18TH ANNUAL CONFERENCE, KES-2014, 2014, 35 : 397 - 406
  • [44] Mining data streams using clustering
    Lu, YH
    Huang, Y
    Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vols 1-9, 2005, : 2079 - 2083
  • [45] Mining Discriminative Itemsets in Data Streams
    Seyfi, Majid
    Geva, Shlomo
    Nayak, Richi
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2014, PT I, 2014, 8786 : 125 - 134
  • [46] On classification and segmentation of massive audio data streams
    Aggarwal, Charu C.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2009, 20 (02) : 137 - 156
  • [47] On clustering massive text and categorical data streams
    Charu C. Aggarwal
    Philip S. Yu
    Knowledge and Information Systems, 2010, 24 : 171 - 196
  • [48] On classification and segmentation of massive audio data streams
    Charu C. Aggarwal
    Knowledge and Information Systems, 2009, 20 : 137 - 156
  • [49] On clustering massive text and categorical data streams
    Aggarwal, Charu C.
    Yu, Philip S.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 24 (02) : 171 - 196
  • [50] Exploring Data Mining Techniques in Medical Data Streams
    Sun, Le
    Ma, Jiangang
    Zhang, Yanchun
    Wang, Hua
    DATABASES THEORY AND APPLICATIONS, (ADC 2016), 2016, 9877 : 321 - 332