Stratified sampling for data mining on the deep web

被引：0

作者：

Tantan Liu

Fan Wang

Gagan Agrawal

机构：

[1] The Ohio State University,Department of Computer Science and Engineering

来源：

Frontiers of Computer Science | 2012年 / 6卷

关键词：

deep web; associate rule mining; stratified sampling;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.

引用

页码：179 / 196

页数：17

共 50 条

[41] Data on mining the deep sea
Heffernan, Olive
NATURE, 2019, 567 (7748) : 294 - 294
[42] A web architecture for data mining in biology
Doncescu, Andrei
Farmer, Muhammad
Inoue, Katsumi
Richard, Gibes
20TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 2, PROCEEDINGS, 2006, : 607 - +
[43] Mining indirect associations in Web data
Tan, PN
Kumar, V
WEBKDD 2001 - MINING WEB LOG DATA ACROSS ALL CUSTOMERS TOUCH POINTS, 2002, 2356 : 145 - 166
[44] Web Log Data Analysis and Mining
Grace, L. K. Joshila
Maheswari, V.
Nagamalai, Dhinaharan
ADVANCED COMPUTING, PT III, 2011, 133 : 459 - 469
[45] Mining the Web of Linked Data with RapidMiner
Ristoski, Petar
Bizer, Christian
Paulheim, Heiko
JOURNAL OF WEB SEMANTICS, 2015, 35 : 142 - 151
[46] Data mining in a closed Web environment
Faba-Pérez, C
Guerrero-Bote, VP
De Moya-Anegón, F
SCIENTOMETRICS, 2003, 58 (03) : 623 - 640
[47] Personalized Web Data Mining System
He, Bo
ADVANCED RESEARCH ON INFORMATION SCIENCE, AUTOMATION AND MATERIAL SYSTEM, PTS 1-6, 2011, 219-220 : 183 - 186
[48] Web data mining and reasoning model
Li, YF
Zhong, N
AI 2004: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3339 : 1128 - 1134
[49] Web Data Mining Trends and Techniques
Patil, Ujwala Manoj
Patil, J. B.
PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI'12), 2012, : 961 - 965
[50] Web Database Based on Data Mining
Yang-bo, Wu
INFORMATION COMPUTING AND APPLICATIONS, ICICA 2013, PT II, 2013, 392 : 76 - 84

← 1 2 3 4 5 →