Sampling, information extraction and summarisation of Hidden Web databases

被引:14
|
作者
Hedley, Yih-Ling [1 ]
Younas, Muhammad
James, Anne
Sanderson, Mark
机构
[1] Coventry Univ, Sch Math & Informat Sci, Coventry CV1 5FB, W Midlands, England
[2] Oxford Brookes Univ, Dept Comp, Oxford OX33 1HP, England
[3] Univ Sheffield, Dept Informat Studies, Sheffield S1 4DP, S Yorkshire, England
关键词
Hidden Web databases; information extraction; document sampling;
D O I
10.1016/j.datak.2006.01.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hidden Web databases maintain a collection of specialised documents, which are dynamically generated using page templates. This paper presents the Two-Phase Sampling (2PS) technique that detects and extracts query-related information from documents contained in databases. 2PS is based on a two-phase framework for the sampling, information extraction and summarisation of Hidden Web documents. In the first phase, 2PS samples and stores documents for further analysis. In the second phase, it detects Web page templates from sampled documents and extracts relevant information from which a content summary is then generated. Experimental results demonstrate that 2PS effectively eliminates irrelevant information from sampled documents and generates terms and frequencies with improved accuracy. (c) 2006 Published by Elsevier B.V.
引用
收藏
页码:213 / 230
页数:18
相关论文
共 50 条
  • [1] Leveraging COUNT Information in Sampling Hidden Databases
    Dasgupta, Arjun
    Zhang, Nan
    Das, Gautam
    [J]. ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 329 - +
  • [2] Summarisation for mobile databases
    Chan, D
    Roddick, JF
    [J]. JOURNAL OF RESEARCH AND PRACTICE IN INFORMATION TECHNOLOGY, 2005, 37 (03): : 267 - 284
  • [3] Sampling strategies for information extraction over the deep web
    Barrio, Pablo
    Gravano, Luis
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (02) : 309 - 331
  • [4] Web information extraction using generalized hidden Markov model
    Zhong, Ping
    Chen, Jinlin
    Cook, Terry
    [J]. 2006 1ST IEEE WORKSHOP ON HOT TOPICS IN WEB SYSTEMS AND TECHNOLOGIES, 2006, : 142 - +
  • [5] A generalized hidden Markov model approach for web information extraction
    Zhong, Ping
    Chen, Jinlin
    [J]. 2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 709 - +
  • [6] Web object information extraction based on generalized hidden Markov model
    Wang, Jing
    Yao, Yong
    Liu, ZhiJing
    [J]. 2007 INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES, VOLS 1-3, 2007, : 1520 - 1523
  • [7] Optimization of hidden Markov model by a genetic algorithm for web information extraction
    Xiao, Jiyi
    Zou, Lamei
    Li, Chuanqi
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND KNOWLEDGE ENGINEERING (ISKE 2007), 2007,
  • [8] A two-phase sampling technique to improve the accuracy of text similarities in the categorisation of hidden web databases
    Hedley, YL
    Younas, M
    James, A
    Sanderson, M
    [J]. WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS, 2004, 3306 : 516 - 527
  • [9] Probability Model Based Hidden Databases Sampling Approach
    Tian Jian-Wei
    Li Shi-Jun
    Lu Qi
    [J]. 2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 11072 - 11075
  • [10] Extraction of Key Information in Web News Based on Improved Hidden Markov Model
    Liu, Zhiqiang
    Du, Yuncheng
    Shi, Shuicai
    [J]. Data Analysis and Knowledge Discovery, 2019, 3 (03): : 120 - 128