Stratified sampling for data mining on the deep web

被引:0
|
作者
Tantan Liu
Fan Wang
Gagan Agrawal
机构
[1] The Ohio State University,Department of Computer Science and Engineering
来源
关键词
deep web; associate rule mining; stratified sampling;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.
引用
收藏
页码:179 / 196
页数:17
相关论文
共 50 条
  • [41] Data on mining the deep sea
    Heffernan, Olive
    NATURE, 2019, 567 (7748) : 294 - 294
  • [42] A web architecture for data mining in biology
    Doncescu, Andrei
    Farmer, Muhammad
    Inoue, Katsumi
    Richard, Gibes
    20TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 2, PROCEEDINGS, 2006, : 607 - +
  • [43] Mining indirect associations in Web data
    Tan, PN
    Kumar, V
    WEBKDD 2001 - MINING WEB LOG DATA ACROSS ALL CUSTOMERS TOUCH POINTS, 2002, 2356 : 145 - 166
  • [44] Web Log Data Analysis and Mining
    Grace, L. K. Joshila
    Maheswari, V.
    Nagamalai, Dhinaharan
    ADVANCED COMPUTING, PT III, 2011, 133 : 459 - 469
  • [45] Mining the Web of Linked Data with RapidMiner
    Ristoski, Petar
    Bizer, Christian
    Paulheim, Heiko
    JOURNAL OF WEB SEMANTICS, 2015, 35 : 142 - 151
  • [46] Data mining in a closed Web environment
    Faba-Pérez, C
    Guerrero-Bote, VP
    De Moya-Anegón, F
    SCIENTOMETRICS, 2003, 58 (03) : 623 - 640
  • [47] Personalized Web Data Mining System
    He, Bo
    ADVANCED RESEARCH ON INFORMATION SCIENCE, AUTOMATION AND MATERIAL SYSTEM, PTS 1-6, 2011, 219-220 : 183 - 186
  • [48] Web data mining and reasoning model
    Li, YF
    Zhong, N
    AI 2004: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3339 : 1128 - 1134
  • [49] Web Data Mining Trends and Techniques
    Patil, Ujwala Manoj
    Patil, J. B.
    PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI'12), 2012, : 961 - 965
  • [50] Web Database Based on Data Mining
    Yang-bo, Wu
    INFORMATION COMPUTING AND APPLICATIONS, ICICA 2013, PT II, 2013, 392 : 76 - 84