Structured data extraction from the web based on partial tree alignment

被引:0
|
作者
Zhai, Yanhong [1 ]
Liu, Bing [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA
关键词
Web data extraction; wrapper generation; partial tree alignement; Web mining;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective.
引用
收藏
页码:1614 / 1628
页数:15
相关论文
共 50 条
  • [21] A novel alignment algorithm for effective web data extraction from singleton-item pages
    Yuliana, Oviliani Yenty
    Chang, Chia-Hui
    [J]. APPLIED INTELLIGENCE, 2018, 48 (11) : 4355 - 4370
  • [22] A novel alignment algorithm for effective web data extraction from singleton-item pages
    Oviliani Yenty Yuliana
    Chia-Hui Chang
    [J]. Applied Intelligence, 2018, 48 : 4355 - 4370
  • [23] Information Extraction from Web Documents Based on unranked Tree Automaton Inference
    Huang Zhaohua
    Yang Fan
    [J]. 2012 FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION NETWORKING AND SECURITY (MINES 2012), 2012, : 195 - 198
  • [24] Data extraction from Web data sources
    Robinson, J
    [J]. 15TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, : 282 - 288
  • [25] Generating finite-state transducers for semi-structured data extraction from the Web
    Hsu, CN
    Dung, MT
    [J]. INFORMATION SYSTEMS, 1998, 23 (08) : 521 - 538
  • [26] Generating finite-state transducers for semi-structured data extraction from the Web
    Academia Sinica, Taipei, Taiwan
    [J]. Inf Syst, 8 (521-538):
  • [27] DOM Tree Based Approach for Web Content Extraction
    Mehta, Bhavdeep
    Narvekar, Meera
    [J]. 2015 International Conference on Communication, Information & Computing Technology (ICCICT), 2015,
  • [28] Method of Web Information Extraction Based on Decision Tree
    Chen Hong-ye
    [J]. 2009 INTERNATIONAL FORUM ON INFORMATION TECHNOLOGY AND APPLICATIONS, VOL 1, PROCEEDINGS, 2009, : 664 - 666
  • [29] Data fusion and feature extraction using tree structured filter banks
    Sveinsson, JR
    Benediktsson, JA
    [J]. IGARSS 2000: IEEE 2000 INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOL I - VI, PROCEEDINGS, 2000, : 2617 - 2619
  • [30] Information Extraction from Semi-Structured WEB Page Based on DOM Tree and Its Application in Scientific Literature Statistical Analysis System
    Li WeiDong
    Dong Yibing
    Wang RuiJiang
    Tian HongXia
    [J]. 2009 IITA INTERNATIONAL CONFERENCE ON SERVICES SCIENCE, MANAGEMENT AND ENGINEERING, PROCEEDINGS, 2009, : 124 - +