A Method of Automatic Web Information Extraction Based on Page Clustering

被引:0
|
作者
Yang, Tianqi [1 ]
Qiu, Taofen [1 ]
机构
[1] Jinan Univ, Dept Comp Sci, Guangzhou, Guangdong, Peoples R China
关键词
web information extraction; page clustering; wrapper generation;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Dynamic web page has a large amount of pages, high-value data and high-modularity structure. According to these feature, this paper developed an automatic web information extraction system based on page clustering. On the basis of DOM extraction technique, it used page clustering to find the high similarity clusters, and improved the accuracy of clustering results by using the column similarity measure and global auto-similarity measure. Extraction template applied the optional nodes to modify and adjust the template in order to improve the identification of the content nodes. Experimental result shows this method automatically locates and extracts the main information of pages and achieves high precision and recall.
引用
收藏
页码:390 / 393
页数:4
相关论文
共 5 条
  • [1] Extracting lists of data records from semi-structured web pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    [J]. DATA & KNOWLEDGE ENGINEERING, 2008, 64 (02) : 491 - 509
  • [2] Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
  • [3] Clustering Web pages based on their structure
    Crescenzi, V
    Merialdo, P
    Missier, P
    [J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (03) : 279 - 299
  • [4] Levenshtein V.I., 1966, Soviet Physics Doklady
  • [5] Clean up your Web pages with HP's HTML']HTML Tidy
    Raggett, D
    [J]. COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7): : 730 - 732