Information extraction from HTML']HTML: Application of a general machine learning approach

被引:0
|
作者
Freitag, D [1 ]
机构
[1] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Because the World Wide Web consists primarily of text, information extraction is central to any effort that would use the Web as a resource for knowledge discovery. We show how information extraction can be cast as a standard machine learning problem, and argue for the suitability of relational learning in solving it. The implementation of a general-purpose relational learner for information extraction, SRV, is described. In contrast with earlier learning systems for information extraction, SRV makes no assumptions about document structure and the kinds of information available for use in learning extraction patterns. Instead, structural and other information is supplied as input in the form of an extensible token-oriented feature set. We demonstrate the effectiveness of this approach by adapting SRV for use in learning extraction rules for a domain consisting of university course and research project pages sampled from the Web. Making SRV Web-ready only involves adding several simple HTML-specific features to its basic feature set.
引用
收藏
页码:517 / 523
页数:7
相关论文
共 50 条
  • [21] Flexible Approach for Web Information Extraction Based on HTML']HTMLParser
    Shan, Lin
    Qun, Zhang
    [J]. PROCEEDINGS OF 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, VOLS I-VI, 2012, : 683 - 686
  • [22] Learning from HTML']HTML - Lessons for DTD authors
    Wood, L
    [J]. SGML '96 CONFERENCE PROCEEDINGS - CELEBRATING A DECADE OF SGML, 1996, : 231 - 233
  • [23] A HTML']HTML to WML Translating Model Based on Information Extraction for Mobile Commerce
    Song, Mingqiu
    Yu, Bo
    [J]. 2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 9166 - 9169
  • [24] The BigGrams: the semi-supervised information extraction system from HTML']HTML: an improvement in the wrapper induction
    Mironczuk, Marcin Michal
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 54 (03) : 711 - 776
  • [25] Categorizing and extracting information from multilingual HTML']HTML documents
    Lim, SJ
    Ng, YK
    [J]. 9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 415 - 422
  • [26] Data-rich section extraction from HTML']HTML pages
    Wang, JY
    Lochovsky, FH
    [J]. WISE 2002: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, 2002, : 313 - 322
  • [27] Automating the extraction of data from HTML']HTML tables with unknown structure
    Embley, DW
    Tao, C
    Liddle, SW
    [J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) : 3 - 28
  • [28] Detecting Research from an Uncurated HTML']HTML Archive Using Semi-Supervised Machine Learning
    McNulty, John
    Alvarez, Sarai
    Langmayr, Michael
    [J]. 2021 SYSTEMS AND INFORMATION ENGINEERING DESIGN SYMPOSIUM (IEEE SIEDS 2021), 2021, : 249 - 254
  • [29] Logic wrappers and XSLT transformations for tuples extraction from HTML']HTML
    Badica, C
    Badica, A
    [J]. DATABASE AND XML TECHNOLOGIES, PROCEEDINGS, 2005, 3671 : 177 - 191
  • [30] A clustering approach to extract data from HTML']HTML tables
    Jimenez, Patricia
    Roldan, Juan C.
    Corchuelo, Rafael
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (06)