Structured data extraction from the web based on partial tree alignment

被引:0
|
作者
Zhai, Yanhong [1 ]
Liu, Bing [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Chicago, IL 60607 USA
关键词
Web data extraction; wrapper generation; partial tree alignement; Web mining;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper studies the problem of structured data extraction from arbitrary Web pages. The objective of the proposed research is to automatically segment data records in a page, extract data items/fields from these records, and store the extracted data in a database. Existing methods addressing the problem can be classified into three categories. Methods in the first category provide some languages to facilitate the construction of data extraction systems. Methods in the second category use machine learning techniques to learn wrappers (which are data extraction programs) from human labeled examples. Manual labeling is time-consuming and is hard to scale to a large number of sites on the Web. Methods in the third category are based on the idea of automatic pattern discovery. However, multiple pages that conform to a common schema are usually needed as the input. In this paper, we propose a novel and effective technique (called DEPTA) to perform the task of Web data extraction automatically. The method consists of two steps: 1) identifying individual records in a page and 2) aligning and extracting data items from the identified records. For step 1, a method based on visual information and tree matching is used to segment data records. For step 2, a novel partial alignment technique is proposed. This method aligns only those data items in a pair of records that can be aligned with certainty, making no commitment on the rest of the items. Experimental results obtained using a large number of Web pages from diverse domains show that the proposed two-step technique is highly effective.
引用
收藏
页码:1614 / 1628
页数:15
相关论文
共 50 条
  • [1] Web Data Extraction Based On Visual Information and Partial Tree Alignment
    Fan, Siwu
    Wang, Xinjun
    Dong, Yongquan
    [J]. 2014 11TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA), 2014, : 18 - 23
  • [2] From One Tree to a Forest: a Unified Solution for Structured Web Data Extraction
    Hao, Qiang
    Cai, Rui
    Pang, Yanwei
    Zhang, Lei
    [J]. PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 775 - 784
  • [3] Information extraction from Web pages using semi-structured data alignment
    Kuboyama, Tetsuji
    Miyahara, Tetsuhiro
    Hirokawa, Sachio
    Itou, Eisuke
    [J]. WMSCI 2005: 9th World Multi-Conference on Systemics, Cybernetics and Informatics, Vol 1, 2005, : 42 - 47
  • [4] An Efficient Mechanism for Deep Web Data Extraction Based on Tree-Structured Web Pattern Matching
    Ahamed, B. Bazeer
    Yuvaraj, D.
    Shitharth, S.
    Mirza, Olfat M.
    Alsobhi, Aisha
    Yafoz, Ayman
    [J]. WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2022, 2022
  • [5] A DOM Tree Alignment Model for Mining Parallel Data from the Web
    Shi, Lei
    Niu, Cheng
    Zhou, Ming
    Gao, Jianfeng
    [J]. COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 489 - 496
  • [6] Web Service for Data Extraction from Semi-structured Data Sources
    Yashina, Marina V.
    Nakonechnyy, Ivan I.
    [J]. PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON DEPENDABILITY AND COMPLEX SYSTEMS DEPCOS-RELCOMEX, 2014, 286 : 499 - 510
  • [7] Web-Scale Extraction of Structured Data
    Cafarella, Michael J.
    Madhavan, Jayant
    Halevy, Alon
    [J]. SIGMOD RECORD, 2008, 37 (04) : 55 - 61
  • [8] Data extraction from semi-structured web pages by clustering
    Vuong, Le Phong Bao
    Gao, Xiaoying
    Zhang, Mengjie
    [J]. 2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 374 - +
  • [9] Dependency Tree based Chinese Relation Extraction over Web Data
    Zheng, Shanshan
    Yang, Jing
    Lin, Xin
    Gu, JunZhong
    [J]. 2012 SEVENTH INTERNATIONAL CONFERENCE ON KNOWLEDGE, INFORMATION AND CREATIVITY SUPPORT SYSTEMS (KICSS 2012), 2012, : 104 - 110
  • [10] DEPTA: An Efficient Technique For Web Data Extraction and Alignment
    Lokhande, Rahul L.
    Manjaramkar, Arati
    [J]. 2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 2307 - 2310