Stratified sampling for data mining on the deep web

被引:0
|
作者
Tantan Liu
Fan Wang
Gagan Agrawal
机构
[1] The Ohio State University,Department of Computer Science and Engineering
来源
关键词
deep web; associate rule mining; stratified sampling;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, the deep web has become extremely popular. Like any other data source, data mining on the deep web can produce important insights or summaries of results. However, data mining on the deep web is challenging because the databases cannot be accessed directly, and therefore, data mining must be performed by sampling the datasets. The samples, in turn, can only be obtained by querying deep web databases with specific inputs. In this paper, we target two related data mining problems, association mining and differential rulemining. These are proposed to extract high-level summaries of the differences in data provided by different deep web data sources in the same domain. We develop stratified sampling methods to perform these mining tasks on a deep web source. Our contributions include a novel greedy stratification approach, which recursively processes the query space of a deep web data source, and considers both the estimation error and the sampling costs. We have also developed an optimized sample allocation method that integrates estimation error and sampling costs. Our experimental results show that our algorithms effectively and consistently reduce sampling costs, compared with a stratified sampling method that only considers estimation error. In addition, compared with simple random sampling, our algorithm has higher sampling accuracy and lower sampling costs.
引用
收藏
页码:179 / 196
页数:17
相关论文
共 50 条
  • [21] Mining Web data on a budget
    Banks, MA
    ONLINE, 2003, 27 (05): : 32 - 35
  • [22] Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data
    Kamlesh Kumar Pandey
    Diwakar Shukla
    International Journal of System Assurance Engineering and Management, 2022, 13 : 1239 - 1253
  • [23] Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data
    Pandey, Kamlesh Kumar
    Shukla, Diwakar
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2022, 13 (03) : 1239 - 1253
  • [24] Neural network approach with boundary sampling for web mining
    Chen, Kairui
    Chen, Hui-Chuan
    INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL II, 2007, : 138 - +
  • [25] Visual data mining of web navigational data
    Chen, Jiyang
    Zheng, Tong
    Thorne, William
    Zaiane, Osmar R.
    Goebel, Randy
    11TH INTERNATIONAL CONFERENCE INFORMATION VISUALIZATION, 2007, : 649 - +
  • [26] Web Mining Service (WMS), a public and free service for web data mining
    Miguel Gago, Jose
    Guerrero, Carlos
    Juiz, Carlos
    Puigjaner, Ramon
    2009 FOURTH INTERNATIONAL CONFERENCE ON INTERNET AND WEB APPLICATIONS AND SERVICES, 2009, : 351 - 356
  • [27] Automatic Bird-Species Recognition using the Deep Learning and Web Data Mining
    Kang, Min-Seok
    Hong, Kwang-Seok
    2018 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2018, : 1258 - 1260
  • [28] DL-WSDM'15: Workshop on Deep Learning for Web Search and Data Mining
    Gao, Bin
    Bian, Jiang
    WSDM'15: PROCEEDINGS OF THE EIGHTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2015, : 421 - 421
  • [29] Data Mining: Web Data Mining Techniques, Tools and Algorithms: An Overview
    Mughal, Muhammd Jawad Hamid
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (06) : 208 - 215
  • [30] Mining the web to add semantics to retail data mining
    Ghani, R
    WEB MINING: FROM WEB TO SEMANTIC WEB, 2004, 3209 : 43 - 56