Information extraction from HTML']HTML: Application of a general machine learning approach

被引：0

作者：

Freitag, D ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA

来源：

FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS | 1998年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Because the World Wide Web consists primarily of text, information extraction is central to any effort that would use the Web as a resource for knowledge discovery. We show how information extraction can be cast as a standard machine learning problem, and argue for the suitability of relational learning in solving it. The implementation of a general-purpose relational learner for information extraction, SRV, is described. In contrast with earlier learning systems for information extraction, SRV makes no assumptions about document structure and the kinds of information available for use in learning extraction patterns. Instead, structural and other information is supplied as input in the form of an extensible token-oriented feature set. We demonstrate the effectiveness of this approach by adapting SRV for use in learning extraction rules for a domain consisting of university course and research project pages sampled from the Web. Making SRV Web-ready only involves adding several simple HTML-specific features to its basic feature set.

引用

页码：517 / 523

页数：7

共 50 条

[1] A General Learning Method for Automatic Title Extraction from HTML']HTML Pages
Changuel, Sahar
Labroche, Nicolas
Bouchon-Meunier, Bernadette
[J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2009, 5632 : 704 - 718
[2] Rule learning for feature values extraction from HTML']HTML product information sheets
Badica, C
Badica, A
[J]. RULES AND RULE MARKUP LANGUAGES FOR THE SEMANTIC WEB, PROCEEDINGS, 2004, 3323 : 37 - 48
[3] Multimedia information extraction from HTML']HTML product catalogues
Labsky, Martin
Praks, Pavel
Svatek, Vojtech
Svab, Ondrej
[J]. DATESO 2005 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS, 2005, : 84 - 93
[4] Layout based information extraction from HTML']HTML documents
Buraet, Radek
[J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
[5] Information extraction from HTML']HTML pages and its integration
Itai, K
Takasu, A
Adachi, J
[J]. 2003 SYMPOSIUM ON APPLICATIONS AND THE INTERNET WORKSHOPS, PROCEEDINGS, 2003, : 276 - 281
[6] Automatic machine learning of keyphrase extraction from short HTML']HTML documents written in Hebrew
Hacohen-Kerner, Yaakov
Stern, Ittay
Korkus, David
Fredj, Erick
[J]. CYBERNETICS AND SYSTEMS, 2007, 38 (01) : 1 - 21
[7] Extraction and integration information in HTML']HTML tables
Li, SJ
Peng, ZY
Liu, MC
[J]. FOURTH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2004, : 315 - 320
[8] Information extraction from HTML']HTML tables base on domain ontology
Hsiao, SL
Chou, SC
Chang, LP
[J]. IKE'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2003, : 70 - 76
[9] An Event Data Extraction Method Based on HTML']HTML Structure Analysis and Machine Learning
Liao, Chenyi
Hiroi, Kei
Kaji, Katsuhiko
Kawaguchi, Nobuo
[J]. IEEE 39TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE WORKSHOPS (COMPSAC 2015), VOL 3, 2015, : 217 - 222
[10] Application of logic wrappers to hierarchical data extraction from HTML']HTML
Badica, Amelia
Badica, Costin
Popescu, Elvira
[J]. PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2007, 4874 : 43 - +

← 1 2 3 4 5 →