Information extraction from HTML']HTML: Application of a general machine learning approach

被引:0
|
作者
Freitag, D [1 ]
机构
[1] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Because the World Wide Web consists primarily of text, information extraction is central to any effort that would use the Web as a resource for knowledge discovery. We show how information extraction can be cast as a standard machine learning problem, and argue for the suitability of relational learning in solving it. The implementation of a general-purpose relational learner for information extraction, SRV, is described. In contrast with earlier learning systems for information extraction, SRV makes no assumptions about document structure and the kinds of information available for use in learning extraction patterns. Instead, structural and other information is supplied as input in the form of an extensible token-oriented feature set. We demonstrate the effectiveness of this approach by adapting SRV for use in learning extraction rules for a domain consisting of university course and research project pages sampled from the Web. Making SRV Web-ready only involves adding several simple HTML-specific features to its basic feature set.
引用
收藏
页码:517 / 523
页数:7
相关论文
共 50 条
  • [31] FLEXIBLE WEB INFORMATION EXTRACTION WITH HTML']HTMLPARSER
    Shan, Lin
    [J]. 3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE (ITCS 2011), PROCEEDINGS, 2011, : 295 - 298
  • [32] Design and Implementation Learning Methods HTML']HTML5 on iCode Application
    Pebriadi, Pebi
    Wuryandari, Aciek Ida
    Setijadi, Ary P.
    [J]. PROCEEDINGS OF THE 2013 JOINT INTERNATIONAL CONFERENCE ON RURAL INFORMATION & COMMUNICATION TECHNOLOGY AND ELECTRIC-VEHICLE TECHNOLOGY (RICT & ICEV-T), 2013,
  • [33] Automatic HTML']HTML Code Generation from Mock-up Images Using Machine Learning Techniques
    Asiroglu, Batuhan
    Mate, Busra Rumeysa
    Yildiz, Eyyup
    Nalcakan, Yagiz
    Sezen, Alper
    Dagtekin, Mustafa
    Ensari, Tolga
    [J]. 2019 SCIENTIFIC MEETING ON ELECTRICAL-ELECTRONICS & BIOMEDICAL ENGINEERING AND COMPUTER SCIENCE (EBBT), 2019,
  • [34] HTML']HTML pattern generator - Automatic data extraction from web pages
    Cosulschi, Mirel
    Giurca, Adrian
    Udrescu, Bogdan
    Constantinescu, Nicolae
    Gabroveanu, Mihai
    [J]. SYNASC 2006: EIGHTH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING, PROCEEDINGS, 2007, : 75 - +
  • [35] An Analytical Approach to Concept Extraction in HTML Environments
    Victor Fresno
    Angela Ribeiro
    [J]. Journal of Intelligent Information Systems, 2004, 22 : 215 - 235
  • [36] Detecting similar HTML']HTML documents using a fuzzy set information retrieval approach
    Yerra, R
    Ng, YK
    [J]. 2005 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2005, : 693 - 699
  • [37] An XML approach to semantically extract data from HTML']HTML tables
    Liu, JX
    Ao, ZY
    Park, HH
    Chen, YF
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2005, 3588 : 696 - 705
  • [38] An automated approach for retrieving hierarchical data from HTML']HTML tables
    Lim, SJ
    Ng, YK
    [J]. PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION KNOWLEDGE MANAGEMENT, CIKM'99, 1999, : 466 - 474
  • [39] A hybrid quantum approach to leveraging data from HTML']HTML tables
    Jimenez, Patricia
    Roldan, Juan C.
    Corchuelo, Rafael
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (02) : 441 - 474
  • [40] Migrating Web Archives from HTML']HTML4 to HTML']HTML5: A Block-Based Approach and Its Evaluation
    Sanoja, Andres
    Gancarski, Stephane
    [J]. ADVANCES IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2017, 2017, 10509 : 375 - 393