Stratified sampling for data mining on the deep web

被引:0
|
作者
Tantan Liu
Fan Wang
Gagan Agrawal
机构
[1] The Ohio State University,Department of Computer Science and Engineering
来源
关键词
deep web; associate rule mining; stratified sampling;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.
引用
收藏
页码:179 / 196
页数:17
相关论文
共 50 条
  • [31] Web Data Mining System Based on Web Services
    Chen, Chunying
    Zhou, Xiongwei
    Zhang, Jianzhong
    HIS 2009: 2009 NINTH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS, VOL 3, PROCEEDINGS, 2009, : 216 - +
  • [32] Preprocessing and mining web log data for web personalization
    Baglioni, M
    Ferrara, U
    Romei, A
    Ruggieri, S
    Turini, F
    AI(ASTERISK)IA 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2003, 2829 : 237 - 249
  • [33] Stratified random sampling from streaming and stored data
    Trong Duc Nguyen
    Ming-Hung Shih
    Divesh Srivastava
    Srikanta Tirthapura
    Bojian Xu
    Distributed and Parallel Databases, 2021, 39 : 665 - 710
  • [34] Stratified random sampling from streaming and stored data
    Nguyen, Trong Duc
    Shih, Ming-Hung
    Srivastava, Divesh
    Tirthapura, Srikanta
    Xu, Bojian
    DISTRIBUTED AND PARALLEL DATABASES, 2021, 39 (03) : 665 - 710
  • [35] Stratified Reservoir Sampling over Heterogeneous Data Streams
    Al-Kateb, Mohammed
    Lee, Byung Suk
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2010, 6187 : 621 - 639
  • [36] ANALYSIS OF CATEGORICAL DATA OBTAINED BY STRATIFIED RANDOM SAMPLING
    IMREY, PB
    SOBEL, E
    FRANCIS, ME
    COMMUNICATIONS IN STATISTICS PART A-THEORY AND METHODS, 1979, 8 (07): : 653 - 670
  • [37] Stratified Sampling for Extreme Multi-label Data
    Merrillees, Maximillian
    Du, Lan
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2021, PT II, 2021, 12713 : 334 - 345
  • [38] STRATIFIED SAMPLING IN ORGANIZATION OF EMPIRICAL DATA-COLLECTION
    BRAVERMAN, EM
    LITVAKOV, BM
    MUCHNIK, IB
    NOVIKOV, SG
    AUTOMATION AND REMOTE CONTROL, 1975, 36 (10) : 1629 - 1641
  • [39] Performance Analysis for Mining Images of Deep Web
    Sabri, Ily Amalina Ahmad
    Man, Mustafa
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (10) : 1 - 7
  • [40] Deep Bayesian Data Mining
    Chien, Jen-Tzung
    PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM '20), 2020, : 865 - 868