A general framework for mining massive data streams

被引:89
|
作者
Domingos, P [1 ]
Hulten, G [1 ]
机构
[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA
关键词
data mining; Hoeffding bounds; machine learning; scalability; subsampling;
D O I
10.1198/1061860032544
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional "one-shot" data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.
引用
收藏
页码:945 / 949
页数:5
相关论文
共 50 条
  • [1] An Efficient Framework of Data Mining and its Analytics on Massive Streams of Big Data Repositories
    Disha, D. N.
    Sowmya, B. J.
    Chetan
    Seema, S.
    PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING, VLSI, ELECTRICAL CIRCUITS AND ROBOTICS (DISCOVER), 2016, : 195 - 200
  • [2] A general framework for mining concept-drifting data streams with evolvable features
    Peng, Jiaqi
    Guo, Jinxia
    Yang, Qinli
    Lu, Jianyun
    Shao, Junmming
    2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 1276 - 1281
  • [3] A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
    Gao, Jing
    Fan, Wei
    Han, Jiawei
    Yu, Philip S.
    PROCEEDINGS OF THE SEVENTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 3 - +
  • [4] A Framework for Classification and Segmentation of Massive Audio Data Streams
    Aggarwal, Charu C.
    KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 1013 - 1017
  • [5] A Framework for Clustering Massive Text and Categorical Data Streams
    Aggarwal, Charu C.
    Yu, Philip S.
    PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 479 - 483
  • [6] A Framework for Clustering Massive-Domain Data Streams
    Aggarwal, Charu C.
    ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 102 - 113
  • [7] Cost-sensitive incremental Classification under the MapReduce framework for Mining Imbalanced Massive Data Streams
    Huang Yuwen
    JOURNAL OF DISCRETE MATHEMATICAL SCIENCES & CRYPTOGRAPHY, 2015, 18 (1-2): : 177 - 194
  • [8] Towards a general framework for data mining
    Dzeroski, Saso
    KNOWLEDGE DISCOVERY IN INDUCTIVE DATABASES, 2007, 4747 : 259 - 300
  • [9] A general framework on temporal data mining
    Pan, Ding
    Pan, Yan
    PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2006, : 1019 - +
  • [10] An ensemble classifier framework for mining noisy data streams
    Ouyang, Zhenzheng
    Zhao, Zipeng
    Li, Mingjun
    Luo, Jianshu
    Journal of Computational Information Systems, 2010, 6 (03): : 671 - 678