Sampling, information extraction and summarisation of Hidden Web databases

被引：14

作者：

Hedley, Yih-Ling ^{[1
]}

Younas, Muhammad

James, Anne

Sanderson, Mark

机构：

[1] Coventry Univ, Sch Math & Informat Sci, Coventry CV1 5FB, W Midlands, England

[2] Oxford Brookes Univ, Dept Comp, Oxford OX33 1HP, England

[3] Univ Sheffield, Dept Informat Studies, Sheffield S1 4DP, S Yorkshire, England

来源：

DATA & KNOWLEDGE ENGINEERING | 2006年 / 59卷 / 02期

关键词：

Hidden Web databases; information extraction; document sampling;

D O I：

10.1016/j.datak.2006.01.009

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated using page templates. This paper presents the Two-Phase Sampling (2PS) technique that detects and extracts query-related information from documents contained in databases. 2PS is based on a two-phase framework for the sampling, information extraction and summarisation of Hidden Web documents. In the first phase, 2PS samples and stores documents for further analysis. In the second phase, it detects Web page templates from sampled documents and extracts relevant information from which a content summary is then generated. Experimental results demonstrate that 2PS effectively eliminates irrelevant information from sampled documents and generates terms and frequencies with improved accuracy. (c) 2006 Published by Elsevier B.V.

引用

页码：213 / 230

页数：18

共 50 条

[1] Leveraging COUNT Information in Sampling Hidden Databases
Dasgupta, Arjun
Zhang, Nan
Das, Gautam
[J]. ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 329 - +
[2] Summarisation for mobile databases
Chan, D
Roddick, JF
[J]. JOURNAL OF RESEARCH AND PRACTICE IN INFORMATION TECHNOLOGY, 2005, 37 (03): : 267 - 284
[3] Sampling strategies for information extraction over the deep web
Barrio, Pablo
Gravano, Luis
[J]. INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (02) : 309 - 331
[4] Web information extraction using generalized hidden Markov model
Zhong, Ping
Chen, Jinlin
Cook, Terry
[J]. 2006 1ST IEEE WORKSHOP ON HOT TOPICS IN WEB SYSTEMS AND TECHNOLOGIES, 2006, : 142 - +
[5] A generalized hidden Markov model approach for web information extraction
Zhong, Ping
Chen, Jinlin
[J]. 2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 709 - +
[6] Web object information extraction based on generalized hidden Markov model
Wang, Jing
Yao, Yong
Liu, ZhiJing
[J]. 2007 INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES, VOLS 1-3, 2007, : 1520 - 1523
[7] Optimization of hidden Markov model by a genetic algorithm for web information extraction
Xiao, Jiyi
Zou, Lamei
Li, Chuanqi
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS AND KNOWLEDGE ENGINEERING (ISKE 2007), 2007,
[8] A two-phase sampling technique to improve the accuracy of text similarities in the categorisation of hidden web databases
Hedley, YL
Younas, M
James, A
Sanderson, M
[J]. WEB INFORMATION SYSTEMS - WISE 2004, PROCEEDINGS, 2004, 3306 : 516 - 527
[9] Probability Model Based Hidden Databases Sampling Approach
Tian Jian-Wei
Li Shi-Jun
Lu Qi
[J]. 2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 11072 - 11075
[10] Extraction of Key Information in Web News Based on Improved Hidden Markov Model
Liu, Zhiqiang
Du, Yuncheng
Shi, Shuicai
[J]. Data Analysis and Knowledge Discovery, 2019, 3 (03): : 120 - 128

← 1 2 3 4 5 →