Interactive tuples extraction from semi-structured data

被引:1
|
作者
Gilleron, Remi
Marty, Patrick
Tommasi, Marc
Torre, Fabien
机构
关键词
D O I
10.1109/WI.2006.102
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tuples of length i - 1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper
引用
收藏
页码:997 / 1004
页数:8
相关论文
共 50 条
  • [1] Interactive Data Extraction from Semi-Structured Text
    Broman, Per
    Thalheim, Bernhard
    [J]. INFORMATION MODELLING AND KNOWLEDGE BASES XXIII, 2012, 237 : 1 - 19
  • [2] Exploratory Visual Analysis and Interactive Pattern Extraction from Semi-Structured Data
    Soto, Axel J.
    Kiros, Ryan
    Keselj, Vlado
    Milios, Evangelos
    [J]. ACM TRANSACTIONS ON INTERACTIVE INTELLIGENT SYSTEMS, 2015, 5 (03)
  • [3] Data extraction from semi-structured web pages by clustering
    Vuong, Le Phong Bao
    Gao, Xiaoying
    Zhang, Mengjie
    [J]. 2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 374 - +
  • [4] Web Service for Data Extraction from Semi-structured Data Sources
    Yashina, Marina V.
    Nakonechnyy, Ivan I.
    [J]. PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON DEPENDABILITY AND COMPLEX SYSTEMS DEPCOS-RELCOMEX, 2014, 286 : 499 - 510
  • [5] List data extraction in semi-structured document
    Xu, H
    Li, JZ
    Xu, P
    [J]. WEB INFORMATION SYSTEMS ENGINEERING - WISE 2005, 2005, 3806 : 584 - 585
  • [6] WICCAO: From semi-structured data to structured data
    Li, Z
    Ng, WK
    [J]. 11TH IEEE INTERNATIONAL CONFERENCE AND WORKSHOP ON THE ENGINEERING OF COMPUTER-BASED SYSTEMS, PROCEEDINGS, 2004, : 86 - 93
  • [7] Knowledge extraction from semi-structured data based on fuzzy techniques
    Ceravolo, P
    Nocerino, MC
    Viviani, M
    [J]. KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 3, PROCEEDINGS, 2004, 3215 : 328 - 334
  • [8] Analyzing semi-structured data for ontological information extraction
    Han, H
    Elmasri, R
    [J]. IC'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTERNET COMPUTING, VOLS I AND II, 2001, : 21 - 27
  • [9] Semi-structured Data Extraction and Schema Knowledge Mining
    陈恩红
    [J]. High Technology Letters, 2001, (01) : 1 - 5
  • [10] Approximate graph schema extraction for semi-structured data
    Wang, QY
    Yu, JX
    Wong, KF
    [J]. ADVANCES IN DATABSE TECHNOLOGY-EDBT 2000, PROCEEDINGS, 2000, 1777 : 302 - 316