Structured data extraction from the web based on partial tree alignment

被引:0
|
作者
Zhai, Yanhong [1 ]
Liu, Bing [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA
关键词
Web data extraction; wrapper generation; partial tree alignement; Web mining;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective.
引用
收藏
页码:1614 / 1628
页数:15
相关论文
共 50 条
  • [41] Extraction of web news from web pages using a ternary tree approach
    Laishram, Debina
    Sebastian, Merin
    [J]. 2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
  • [42] On the hardness of learning queries from tree structured data
    Liu, Xianmin
    Li, Jianzhong
    [J]. JOURNAL OF COMBINATORIAL OPTIMIZATION, 2015, 29 (03) : 670 - 684
  • [43] On the hardness of learning queries from tree structured data
    Xianmin Liu
    Jianzhong Li
    [J]. Journal of Combinatorial Optimization, 2015, 29 : 670 - 684
  • [44] Knowledge extraction from semi-structured data based on fuzzy techniques
    Ceravolo, P
    Nocerino, MC
    Viviani, M
    [J]. KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 3, PROCEEDINGS, 2004, 3215 : 328 - 334
  • [45] The Dynamic Web Pages Information Extraction Algorithm Based on Sequence Alignment
    Guo, Dongwei
    Li, Dan
    Liu, Miao
    Liu, Yanbin
    Chen, Sha
    [J]. INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL SCIENCES AND OPTIMIZATION, VOL 1, PROCEEDINGS, 2009, : 784 - 786
  • [46] Redactable Signature Scheme for Tree-structured Data based on Merkle Tree
    Hirose, Shoichi
    Kuwakado, Hidenori
    [J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY (SECRYPT 2013), 2013, : 313 - 320
  • [47] Web data extraction based on structural similarity
    Zhao Li
    Wee Keong Ng
    Aixin Sun
    [J]. Knowledge and Information Systems, 2005, 8 : 438 - 461
  • [48] Web Data Extraction Based on Structure Feature
    Ma Anxiang
    Gao Kening
    Zhang Xiaohong
    Zhang Bin
    [J]. 2010 THE 3RD INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INDUSTRIAL APPLICATION (PACIIA2010), VOL III, 2010, : 347 - 350
  • [49] Web Data Extraction Based on Structure Feature
    Ma Anxiang
    Gao Kening
    Zhang Xiaohong
    Zhang Bin
    [J]. APPLIED INFORMATICS AND COMMUNICATION, PT III, 2011, 226 : 591 - 599
  • [50] Web data extraction based on structural similarity
    Li, Z
    Ng, WK
    Sun, AX
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2005, 8 (04) : 438 - 461