Stratified sampling for data mining on the deep web

被引:0
|
作者
Tantan Liu
Fan Wang
Gagan Agrawal
机构
[1] The Ohio State University,Department of Computer Science and Engineering
来源
关键词
deep web; associate rule mining; stratified sampling;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.
引用
收藏
页码:179 / 196
页数:17
相关论文
共 50 条
  • [1] Stratified sampling for data mining on the deep web
    Liu, Tantan
    Wang, Fan
    Agrawal, Gagan
    FRONTIERS OF COMPUTER SCIENCE, 2012, 6 (02) : 179 - 196
  • [2] Stratified Sampling Design Based on Data Mining
    Kim, Yeonkook J.
    Oh, Yoonhwan
    Park, Sunghoon
    Cho, Sungzoon
    Park, Hayoung
    HEALTHCARE INFORMATICS RESEARCH, 2013, 19 (03) : 186 - 195
  • [3] Stratified sampling for association rules mining
    Li, YR
    Gopalan, RP
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS II, 2005, 187 : 79 - 88
  • [4] A novel building sampling approach leveraging data mining and stratified sampling theory for energy optimization
    Fang, Zhijian
    Lei, Lei
    Zheng, Run
    ENERGY AND BUILDINGS, 2025, 330
  • [5] Web + Data Mining = Web Mining
    Kilian Stoffel
    HMD Praxis der Wirtschaftsinformatik, 2009, 46 (4) : 6 - 20
  • [6] Euclidean distance stratified random sampling based clustering model for big data mining
    Pandey, Kamlesh Kumar
    Shukla, Diwakar
    COMPUTATIONAL AND MATHEMATICAL METHODS, 2021, 3 (06)
  • [7] Deep web content mining
    Ajoudanian, Shohreh
    Jazi, Mohammad Davarpanah
    World Academy of Science, Engineering and Technology, 2009, 37 : 501 - 505
  • [8] Deep sampling and testing in soft stratified clay
    Cummings, SJ
    Sivakumar, V
    Doran, IG
    Graham, J
    CANADIAN GEOTECHNICAL JOURNAL, 2003, 40 (03) : 575 - 586
  • [9] Web data mining
    Wibonele, KJ
    Zhang, YQ
    DATA MINING AND KNOWLEDGE DISCOVERY: THEORY, TOOLS AND TECHNOLOGY IV, 2002, 4730 : 241 - 244
  • [10] Data mining for the web
    Spiliopoulou, M
    PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, 1999, 1704 : 588 - 589