Title identification of web article pages using HTML']HTML and visual features

被引：2

作者：

Fan, Jian ^{[1
]}

Luo, Ping ^{[2
]}

Joshi, Parag ^{[1
]}

机构：

[1] Hewlett Packard Labs, 1501 Page Mill Rd, Palo Alto, CA 94304 USA

[2] Hewlett Packard Labs, Beijing 100084, Peoples R China

来源：

IMAGING AND PRINTING IN A WEB 2.0 WORLD II | 2011年 / 7879卷

关键词：

data extraction; web article extraction; title identification;

D O I：

10.1117/12.876708

中图分类号：

O43 [光学];

学科分类号：

070207 ; 0803 ;

摘要：

Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.

引用

页数：5

共 50 条

[1] Mining Web Pages Using Features of Rendering HTML']HTML Elements in the Web Browser
Fernandez, F. J.
Alvarez, Jose L.
Abad, Pedro J.
Jimenez, Patricia
TRENDS IN PRACTICAL APPLICATIONS OF AGENTS AND MULTI-AGENTS SYSTEMS, 2011, 90 : 161 - 168
[2] Creating Web pages with HTML']HTML
McClees, M
NURSING INFORMATICS: THE IMPACT OF NURSING KNOWLEDGE ON HEALTH CARE INFORMATICS, 1997, 46 : 561 - 561
[3] Integration of HTML']HTML Tables in Web Pages
Akbar, Memen
Azizah, Fazat Nur
Saptawati, G. A. Putri
2015 INTERNATIONAL CONFERENCE ON DATA AND SOFTWARE ENGINEERING (ICODSE), 2015, : 132 - 137
[4] Converting Web Pages Mockups to HTML']HTML using Machine Learning
Boucas, Tiago
Esteves, Antonio
PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES (WEBIST), 2020, : 217 - 224
[5] Creating cool HTML']HTML 4 Web pages
Lisberg, B
TECHNICAL COMMUNICATION, 1999, 46 (02) : 265 - 266
[6] A General Learning Method for Automatic Title Extraction from HTML']HTML Pages
Changuel, Sahar
Labroche, Nicolas
Bouchon-Meunier, Bernadette
MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2009, 5632 : 704 - 718
[7] HTML']HTML-LSTM: Information Extraction from HTML']HTML Tables in Web Pages Using Tree-Structured LSTM
Kawamura, Kazuki
Yamamoto, Akihiro
DISCOVERY SCIENCE (DS 2021), 2021, 12986 : 29 - 43
[8] HTML editors: Web pages in minutes
Don E. Descy
TechTrends, 1999, 43 (4) : 5 - 7
[9] SEDE: A Schema Explorer and Data Extractor for HTML']HTML Web Pages
Deng, Xubin
INFORMATION AND MANAGEMENT ENGINEERING, PT VI, 2011, 236 : 26 - 33
[10] Finding and using the content texts of HTML']HTML pages
Ma, Jun
Chen, Zhumin
Lian, Li
Li, Lianxia
INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 656 - 662

← 1 2 3 4 5 →