Information extraction from HTML']HTML: Application of a general machine learning approach

被引:0
|
作者
Freitag, D [1 ]
机构
[1] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Because the World Wide Web consists primarily of text, information extraction is central to any effort that would use the Web as a resource for knowledge discovery. We show how information extraction can be cast as a standard machine learning problem, and argue for the suitability of relational learning in solving it. The implementation of a general-purpose relational learner for information extraction, SRV, is described. In contrast with earlier learning systems for information extraction, SRV makes no assumptions about document structure and the kinds of information available for use in learning extraction patterns. Instead, structural and other information is supplied as input in the form of an extensible token-oriented feature set. We demonstrate the effectiveness of this approach by adapting SRV for use in learning extraction rules for a domain consisting of university course and research project pages sampled from the Web. Making SRV Web-ready only involves adding several simple HTML-specific features to its basic feature set.
引用
收藏
页码:517 / 523
页数:7
相关论文
共 50 条
  • [1] A General Learning Method for Automatic Title Extraction from HTML']HTML Pages
    Changuel, Sahar
    Labroche, Nicolas
    Bouchon-Meunier, Bernadette
    [J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2009, 5632 : 704 - 718
  • [2] Rule learning for feature values extraction from HTML']HTML product information sheets
    Badica, C
    Badica, A
    [J]. RULES AND RULE MARKUP LANGUAGES FOR THE SEMANTIC WEB, PROCEEDINGS, 2004, 3323 : 37 - 48
  • [3] Multimedia information extraction from HTML']HTML product catalogues
    Labsky, Martin
    Praks, Pavel
    Svatek, Vojtech
    Svab, Ondrej
    [J]. DATESO 2005 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS, 2005, : 84 - 93
  • [4] Layout based information extraction from HTML']HTML documents
    Buraet, Radek
    [J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
  • [5] Information extraction from HTML']HTML pages and its integration
    Itai, K
    Takasu, A
    Adachi, J
    [J]. 2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET WORKSHOPS, PROCEEDINGS, 2003, : 276 - 281
  • [6] Automatic machine learning of keyphrase extraction from short HTML']HTML documents written in Hebrew
    Hacohen-Kerner, Yaakov
    Stern, Ittay
    Korkus, David
    Fredj, Erick
    [J]. CYBERNETICS AND SYSTEMS, 2007, 38 (01) : 1 - 21
  • [7] Extraction and integration information in HTML']HTML tables
    Li, SJ
    Peng, ZY
    Liu, MC
    [J]. FOURTH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2004, : 315 - 320
  • [8] Information extraction from HTML']HTML tables base on domain ontology
    Hsiao, SL
    Chou, SC
    Chang, LP
    [J]. IKE'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2003, : 70 - 76
  • [9] An Event Data Extraction Method Based on HTML']HTML Structure Analysis and Machine Learning
    Liao, Chenyi
    Hiroi, Kei
    Kaji, Katsuhiko
    Kawaguchi, Nobuo
    [J]. IEEE 39TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE WORKSHOPS (COMPSAC 2015), VOL 3, 2015, : 217 - 222
  • [10] Application of logic wrappers to hierarchical data extraction from HTML']HTML
    Badica, Amelia
    Badica, Costin
    Popescu, Elvira
    [J]. PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2007, 4874 : 43 - +