A general framework for mining massive data streams

被引:89
|
作者
Domingos, P [1 ]
Hulten, G [1 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
data mining; Hoeffding bounds; machine learning; scalability; subsampling;
D O I
10.1198/1061860032544
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional "one-shot" data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.
引用
收藏
页码:945 / 949
页数:5
相关论文
共 50 条
  • [31] EDM: A general framework for data mining based on evidence theory
    Anand, SS
    Bell, DA
    Hughes, JG
    DATA & KNOWLEDGE ENGINEERING, 1996, 18 (03) : 189 - 223
  • [32] Performance analysis of Hoeffding trees in data streams by using massive online analysis framework
    Srimani, P. K.
    Patil, Malini M.
    INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2015, 7 (04) : 293 - 313
  • [33] Mining the frequency of time-constrained serial episodes over massive data sequences and streams
    Li, Hui
    Li, Zhe
    Peng, Sizhe
    Li, Jingjing
    Tungom, Chia Emmanuel
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 : 849 - 863
  • [34] Scalable Splitting of Massive Data Streams
    Zeitler, Erik
    Risch, Tore
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT II, PROCEEDINGS, 2010, 5982 : 184 - 198
  • [35] Data Mining In Massive Spectral Data
    Wang, Wenyu
    Wang, Xinjun
    Jiang, Bin
    Pan, Jingchang
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2012, 15 (06): : 2357 - 2363
  • [36] An Ontology-driven MapReduce Framework for Association Rules Mining in Massive Data
    Gahar, Rania Mkhinini
    Arfaoui, Olfa
    Sassi Hidri, Minyar
    Ben Hadj-Alouane, Nejib
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KES-2018), 2018, 126 : 224 - 233
  • [37] Mining discriminative itemsets in data streams
    Seyfi, Majid (m.seyfi@qut.edu.au), 1600, Springer Verlag (8786):
  • [38] Towards Mining Trapezoidal Data Streams
    Zhang, Qin
    Zhang, Peng
    Long, Guodong
    Ding, Wei
    Zhang, Chengqi
    Wu, Xindong
    2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 1111 - 1116
  • [39] Mining Regular Patterns in Data Streams
    Tanbeer, Syed Khairuzzaman
    Ahmed, Chowdhury Farhan
    Jeong, Byeong-Soo
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT I, PROCEEDINGS, 2010, 5981 : 399 - 413
  • [40] Mining continuously changing data streams
    Lu Yi-hong
    Wang Zi-ren
    Huang Yan
    ISTM/2007: 7TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-7, CONFERENCE PROCEEDINGS, 2007, : 6238 - 6242