A Method of Automatic Web Information Extraction Based on Page Clustering

被引：0

作者：

Yang, Tianqi ^{[1
]}

Qiu, Taofen ^{[1
]}

机构：

[1] Jinan Univ, Dept Comp Sci, Guangzhou, Guangdong, Peoples R China

来源：

2011 9TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA 2011) | 2011年

关键词：

web information extraction; page clustering; wrapper generation;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Dynamic web page has a large amount of pages, high-value data and high-modularity structure. According to these feature, this paper developed an automatic web information extraction system based on page clustering. On the basis of DOM extraction technique, it used page clustering to find the high similarity clusters, and improved the accuracy of clustering results by using the column similarity measure and global auto-similarity measure. Extraction template applied the optional nodes to modify and adjust the template in order to improve the identification of the content nodes. Experimental result shows this method automatically locates and extracts the main information of pages and achieves high precision and recall.

引用

页码：390 / 393

页数：4

共 5 条

[1] Extracting lists of data records from semi-structured web pages
Alvarez, Manuel
Pan, Alberto
Raposo, Juan
Bellas, Fernando
Cacheda, Fidel
[J]. DATA & KNOWLEDGE ENGINEERING, 2008, 64 (02) : 491 - 509
[2] Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
[3] Clustering Web pages based on their structure
Crescenzi, V
Merialdo, P
Missier, P
[J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (03) : 279 - 299
[4] Levenshtein V.I., 1966, Soviet Physics Doklady
[5] Clean up your Web pages with HP's HTML']HTML Tidy
Raggett, D
[J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 730 - 732

← 1 →