Information extraction from HTML']HTML: Application of a general machine learning approach

被引：0

作者：

Freitag, D ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA

来源：

FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS | 1998年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Because the World Wide Web consists primarily of text, information extraction is central to any effort that would use the Web as a resource for knowledge discovery. We show how information extraction can be cast as a standard machine learning problem, and argue for the suitability of relational learning in solving it. The implementation of a general-purpose relational learner for information extraction, SRV, is described. In contrast with earlier learning systems for information extraction, SRV makes no assumptions about document structure and the kinds of information available for use in learning extraction patterns. Instead, structural and other information is supplied as input in the form of an extensible token-oriented feature set. We demonstrate the effectiveness of this approach by adapting SRV for use in learning extraction rules for a domain consisting of university course and research project pages sampled from the Web. Making SRV Web-ready only involves adding several simple HTML-specific features to its basic feature set.

引用

页码：517 / 523

页数：7

共 50 条

[21] Flexible Approach for Web Information Extraction Based on HTML']HTMLParser
Shan, Lin
Qun, Zhang
[J]. PROCEEDINGS OF 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, VOLS I-VI, 2012, : 683 - 686
[22] Learning from HTML']HTML - Lessons for DTD authors
Wood, L
[J]. SGML '96 CONFERENCE PROCEEDINGS - CELEBRATING A DECADE OF SGML, 1996, : 231 - 233
[23] A HTML']HTML to WML Translating Model Based on Information Extraction for Mobile Commerce
Song, Mingqiu
Yu, Bo
[J]. 2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 9166 - 9169
[24] The BigGrams: the semi-supervised information extraction system from HTML']HTML: an improvement in the wrapper induction
Mironczuk, Marcin Michal
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2018, 54 (03) : 711 - 776
[25] Categorizing and extracting information from multilingual HTML']HTML documents
Lim, SJ
Ng, YK
[J]. 9TH INTERNATIONAL DATABASE ENGINEERING & APPLICATION SYMPOSIUM, PROCEEDINGS, 2005, : 415 - 422
[26] Data-rich section extraction from HTML']HTML pages
Wang, JY
Lochovsky, FH
[J]. WISE 2002: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, 2002, : 313 - 322
[27] Automating the extraction of data from HTML']HTML tables with unknown structure
Embley, DW
Tao, C
Liddle, SW
[J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) : 3 - 28
[28] Detecting Research from an Uncurated HTML']HTML Archive Using Semi-Supervised Machine Learning
McNulty, John
Alvarez, Sarai
Langmayr, Michael
[J]. 2021 SYSTEMS AND INFORMATION ENGINEERING DESIGN SYMPOSIUM (IEEE SIEDS 2021), 2021, : 249 - 254
[29] Logic wrappers and XSLT transformations for tuples extraction from HTML']HTML
Badica, C
Badica, A
[J]. DATABASE AND XML TECHNOLOGIES, PROCEEDINGS, 2005, 3671 : 177 - 191
[30] A clustering approach to extract data from HTML']HTML tables
Jimenez, Patricia
Roldan, Juan C.
Corchuelo, Rafael
[J]. INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (06)

← 1 2 3 4 5 →