Information extraction from structured documents using k-testable tree automaton inference

被引:15
|
作者
Kosala, Raymond
Blockeel, Hendrik
Bruynooghe, Maurice
Van den Bussche, Jan
机构
[1] Katholieke Univ Leuven, Dept Comp Sci, B-3001 Heverlee, Belgium
[2] Limburgs Univ Ctr, Dept WNI, B-3590 Diepenbeek, Belgium
关键词
information extraction; wrapper induction; tree automata; machine learning;
D O I
10.1016/j.datak.2005.05.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from structured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. These methods do not exploit the tree structure of the documents. A natural way to do this is to induce tree automata, which are like finite state automata but parse trees instead of strings. In this work, we explore induction of k-testable ranked tree automata from a small set of annotated examples. We describe three variants which differ in the way they generalize the inferred automaton. Experimental results on a set of benchmark data sets show that our approach compares favorably to string-based approaches. However, the quality of the extraction is still suboptimal. (c) 2005 Elsevier B.V. All rights reserved.
引用
收藏
页码:129 / 158
页数:30
相关论文
共 50 条
  • [1] Information Extraction from Web Documents Based on unranked Tree Automaton Inference
    Huang Zhaohua
    Yang Fan
    [J]. 2012 FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA INFORMATION NETWORKING AND SECURITY (MINES 2012), 2012, : 195 - 198
  • [2] Probabilistic k-testable tree languages
    Rico-Juan, JR
    Calera-Rubio, J
    Carrasco, RC
    [J]. GRAMMATICAL INFERENCE: ALGORITHMS AND APPLICATIONS, 2000, 1891 : 221 - 228
  • [3] Stochastic k-testable tree languages and applications
    Rico-Juan, JR
    Calera-Rubio, J
    Carrasco, RC
    [J]. GRAMMATICAL INFERENCE: ALGORITHMS AND APPLICATIONS, 2002, 2484 : 199 - 212
  • [4] Smoothing and compression with stochastic k-testable tree languages
    Rico-Juan, JR
    Calera-Rubio, J
    Carrasco, RC
    [J]. PATTERN RECOGNITION, 2005, 38 (09) : 1420 - 1430
  • [5] Inference of Markov Chain Models by Using k-Testable Language: Application on Aging People
    Combes, Catherine
    Azema, Jean
    [J]. TRANSACTIONS ON COMPUTATIONAL COLLECTIVE INTELLIGENCE XVII, 2014, 8790 : 89 - 106
  • [6] Information extraction from the structured part of office documents
    Hao, XL
    Wang, JTL
    Ng, PA
    [J]. INFORMATION SCIENCES, 1996, 91 (3-4) : 245 - 274
  • [7] Learning from similarity and information extraction from structured documents
    Holecek, Martin
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021,
  • [8] Learning from similarity and information extraction from structured documents
    Martin Holeček
    [J]. International Journal on Document Analysis and Recognition (IJDAR), 2021, 24 : 149 - 165
  • [9] Learning from similarity and information extraction from structured documents
    Holecek, Martin
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2021, 24 (03) : 149 - 165
  • [10] Information extraction from semi-structured web documents
    Yun, Bo-Hyun
    Seo, Chang-Ho
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, 2006, 4092 : 586 - 598