A general framework for mining massive data streams

被引：89

作者：

Domingos, P ^{[1
]}

Hulten, G ^{[1
]}

机构：

[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA

来源：

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS | 2003年 / 12卷 / 04期

关键词：

data mining; Hoeffding bounds; machine learning; scalability; subsampling;

D O I：

10.1198/1061860032544

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional "one-shot" data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.

引用

页码：945 / 949

页数：5

共 50 条

[1] An Efficient Framework of Data Mining and its Analytics on Massive Streams of Big Data Repositories
Disha, D. N.
Sowmya, B. J.
Chetan
Seema, S.
PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING, VLSI, ELECTRICAL CIRCUITS AND ROBOTICS (DISCOVER), 2016, : 195 - 200
[2] A general framework for mining concept-drifting data streams with evolvable features
Peng, Jiaqi
Guo, Jinxia
Yang, Qinli
Lu, Jianyun
Shao, Junmming
2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 1276 - 1281
[3] A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
Gao, Jing
Fan, Wei
Han, Jiawei
Yu, Philip S.
PROCEEDINGS OF THE SEVENTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 3 - +
[4] A Framework for Classification and Segmentation of Massive Audio Data Streams
Aggarwal, Charu C.
KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 1013 - 1017
[5] A Framework for Clustering Massive Text and Categorical Data Streams
Aggarwal, Charu C.
Yu, Philip S.
PROCEEDINGS OF THE SIXTH SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2006, : 479 - 483
[6] A Framework for Clustering Massive-Domain Data Streams
Aggarwal, Charu C.
ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 102 - 113
[7] Cost-sensitive incremental Classification under the MapReduce framework for Mining Imbalanced Massive Data Streams
Huang Yuwen
JOURNAL OF DISCRETE MATHEMATICAL SCIENCES & CRYPTOGRAPHY, 2015, 18 (1-2): : 177 - 194
[8] Towards a general framework for data mining
Dzeroski, Saso
KNOWLEDGE DISCOVERY IN INDUCTIVE DATABASES, 2007, 4747 : 259 - 300
[9] A general framework on temporal data mining
Pan, Ding
Pan, Yan
PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2006, : 1019 - +
[10] An ensemble classifier framework for mining noisy data streams
Ouyang, Zhenzheng
Zhao, Zipeng
Li, Mingjun
Luo, Jianshu
Journal of Computational Information Systems, 2010, 6 (03): : 671 - 678

← 1 2 3 4 5 →