Hierarchical Wrapper Induction for Semistructured Information Sources

被引:0
|
作者
Ion Muslea
Steven Minton
Craig A. Knoblock
机构
[1] University of Southern California,Information Sciences Institute and Integrated Media Systems Center
[2] University of Southern California,Information Sciences Institute and Integrated Media Systems Center
[3] University of Southern California,Information Sciences Institute and Integrated Media Systems Center
关键词
wrapper induction; information extraction; supervised inductive learning; information agents;
D O I
暂无
中图分类号
学科分类号
摘要
With the tremendous amount of information that becomes available on the Web on a daily basis, the ability to quickly develop information agents has become a crucial problem. A vital component of any Web-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. Our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document into a series of simpler extraction tasks. We introduce an inductive algorithm, STALKER, that generates high accuracy extraction rules based on user-labeled training examples. Labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that STALKER requires up to two orders of magnitude fewer examples than other algorithms. Furthermore, STALKER can wrap information sources that could not be wrapped by existing inductive techniques.
引用
收藏
页码:93 / 114
页数:21
相关论文
共 50 条
  • [1] Hierarchical wrapper induction for semistructured information sources
    Muslea, I
    Minton, S
    Knoblock, CA
    [J]. AUTONOMOUS AGENTS AND MULTI-AGENT SYSTEMS, 2001, 4 (1-2) : 93 - 114
  • [2] Wrapper induction for information extraction
    Kushmerick, N
    Weld, DS
    Doorenbos, R
    [J]. IJCAI-97 - PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, 1997, : 729 - 735
  • [3] Sources of success for boosted wrapper induction
    Kauchak, David
    Smarr, Joseph
    Elkan, Charles
    [J]. Journal of Machine Learning Research, 2004, 5 : 499 - 527
  • [4] Sources of success for Boosted Wrapper Induction
    Kauchak, D
    Smarr, J
    Elkan, C
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2004, 5 : 499 - 527
  • [5] Semi-automatic wrapper generation for Internet information sources
    Ashish, N
    Knoblock, CA
    [J]. PROCEEDINGS OF THE SECOND IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS - COOPIS'97, 1997, : 160 - 169
  • [6] View maintenance for hierarchical semistructured data
    Liefke, H
    Davidson, SB
    [J]. DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2000, 1874 : 114 - 125
  • [7] Boosted wrapper induction
    Freitag, D
    Kushmerick, N
    [J]. SEVENTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-2001) / TWELFTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-2000), 2000, : 577 - 583
  • [8] Wrapper Induction of News Information for Feeding to Social Networking Service on Smartphone
    Xiang, Zhong-Liang
    Yu, Xiang-Ru
    Kang, Dae-Ki
    [J]. 2015 17TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT), 2015, : 292 - 295
  • [9] A meta-wrapper for scaling up to multiple autonomous distributed information sources
    Vidal, ME
    Raschid, L
    Gruser, JR
    [J]. 3RD IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS - PROCEEDINGS, 1998, : 148 - 157
  • [10] The Use of Ontologies in Wrapper Induction
    Nekvasil, Marek
    [J]. DATESO 2007 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS: PROCEEDINGS OF THE 7TH ANNUAL INTERNATIONAL WORKSHOP, 2007, 235 : 132 - 135