A general framework for mining massive data streams

被引：89

作者：

Domingos, P ^{[1
]}

Hulten, G ^{[1
]}

机构：

[1] Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA

来源：

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS | 2003年 / 12卷 / 04期

关键词：

data mining; Hoeffding bounds; machine learning; scalability; subsampling;

D O I：

10.1198/1061860032544

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional "one-shot" data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.

引用

页码：945 / 949

页数：5

共 50 条

[31] EDM: A general framework for data mining based on evidence theory
Anand, SS
Bell, DA
Hughes, JG
DATA & KNOWLEDGE ENGINEERING, 1996, 18 (03) : 189 - 223
[32] Performance analysis of Hoeffding trees in data streams by using massive online analysis framework
Srimani, P. K.
Patil, Malini M.
INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2015, 7 (04) : 293 - 313
[33] Mining the frequency of time-constrained serial episodes over massive data sequences and streams
Li, Hui
Li, Zhe
Peng, Sizhe
Li, Jingjing
Tungom, Chia Emmanuel
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 110 : 849 - 863
[34] Scalable Splitting of Massive Data Streams
Zeitler, Erik
Risch, Tore
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT II, PROCEEDINGS, 2010, 5982 : 184 - 198
[35] Data Mining In Massive Spectral Data
Wang, Wenyu
Wang, Xinjun
Jiang, Bin
Pan, Jingchang
INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2012, 15 (06): : 2357 - 2363
[36] An Ontology-driven MapReduce Framework for Association Rules Mining in Massive Data
Gahar, Rania Mkhinini
Arfaoui, Olfa
Sassi Hidri, Minyar
Ben Hadj-Alouane, Nejib
KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KES-2018), 2018, 126 : 224 - 233
[37] Mining discriminative itemsets in data streams
Seyfi, Majid (m.seyfi@qut.edu.au), 1600, Springer Verlag (8786):
[38] Towards Mining Trapezoidal Data Streams
Zhang, Qin
Zhang, Peng
Long, Guodong
Ding, Wei
Zhang, Chengqi
Wu, Xindong
2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 1111 - 1116
[39] Mining Regular Patterns in Data Streams
Tanbeer, Syed Khairuzzaman
Ahmed, Chowdhury Farhan
Jeong, Byeong-Soo
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT I, PROCEEDINGS, 2010, 5981 : 399 - 413
[40] Mining continuously changing data streams
Lu Yi-hong
Wang Zi-ren
Huang Yan
ISTM/2007: 7TH INTERNATIONAL SYMPOSIUM ON TEST AND MEASUREMENT, VOLS 1-7, CONFERENCE PROCEEDINGS, 2007, : 6238 - 6242

← 1 2 3 4 5 →