A General Learning Method for Automatic Title Extraction from HTML']HTML Pages

被引：0

作者：

Changuel, Sahar ^{[1
]}

Labroche, Nicolas ^{[1
]}

Bouchon-Meunier, Bernadette ^{[1
]}

机构：

[1] LIP6, DAPA, F-75016 Paris, France

来源：

MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION | 2009年 / 5632卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper addresses the problem of automatically learning the title metadata from HTML documents. The objective, is to help indexing Web resources that are poorly annotated. Other works proposed similar objectives, but they considered only titles in text format . In this paper we propose a general learning schema that allows learning textual titles based on style information and image format titles based on image properties. We construct features from automatically annotated pages harvested from the Web; this paper details the corpus creation method as well as the information extraction techniques. Based oil these features. learning algorithms, such as Decision Trees and Random Forest algorithms are applied achieving good results despite the heterogeneity of our corpus, we also show that, combining both methods can induce better performance.

引用

页码：704 / 718

页数：15

共 50 条

[21] Multimedia information extraction from HTML']HTML product catalogues
Labsky, Martin
Praks, Pavel
Svatek, Vojtech
Svab, Ondrej
[J]. DATESO 2005 - DATABASES, TEXTS, SPECIFICATIONS, OBJECTS, 2005, : 84 - 93
[22] Layout based information extraction from HTML']HTML documents
Buraet, Radek
[J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
[23] Learning from HTML']HTML - Lessons for DTD authors
Wood, L
[J]. SGML '96 CONFERENCE PROCEEDINGS - CELEBRATING A DECADE OF SGML, 1996, : 231 - 233
[24] Automating the extraction of data from HTML']HTML tables with unknown structure
Embley, DW
Tao, C
Liddle, SW
[J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) : 3 - 28
[25] Information extraction from HTML']HTML tables base on domain ontology
Hsiao, SL
Chou, SC
Chang, LP
[J]. IKE'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2003, : 70 - 76
[26] Intelligent HTML']HTML to VXML conversion using automatic object extraction and prior structural knowledge
Jang, Young-Gun
[J]. 2006 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PTS 1 AND 2, PROCEEDINGS, 2006, : 1446 - 1451
[27] Application of logic wrappers to hierarchical data extraction from HTML']HTML
Badica, Amelia
Badica, Costin
Popescu, Elvira
[J]. PROGRESS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2007, 4874 : 43 - +
[28] Logic wrappers and XSLT transformations for tuples extraction from HTML']HTML
Badica, C
Badica, A
[J]. DATABASE AND XML TECHNOLOGIES, PROCEEDINGS, 2005, 3671 : 177 - 191
[29] Automatic HTML']HTML Code Generation from Mock-up Images Using Machine Learning Techniques
Asiroglu, Batuhan
Mate, Busra Rumeysa
Yildiz, Eyyup
Nalcakan, Yagiz
Sezen, Alper
Dagtekin, Mustafa
Ensari, Tolga
[J]. 2019 SCIENTIFIC MEETING ON ELECTRICAL-ELECTRONICS & BIOMEDICAL ENGINEERING AND COMPUTER SCIENCE (EBBT), 2019,
[30] A Web Content Extraction Method Base on Punctuation Distribution and HTML']HTML Tag Similarity
Gong, Nan
Fan, Chunxiao
Wu, Yuexin
Ming, Yue
[J]. LISS 2013, 2015, : 803 - 810

← 1 2 3 4 5 →