Title identification of web article pages using HTML']HTML and visual features

被引:2
|
作者
Fan, Jian [1 ]
Luo, Ping [2 ]
Joshi, Parag [1 ]
机构
[1] Hewlett Packard Labs, 1501 Page Mill Rd, Palo Alto, CA 94304 USA
[2] Hewlett Packard Labs, Beijing 100084, Peoples R China
关键词
data extraction; web article extraction; title identification;
D O I
10.1117/12.876708
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] Mining Web Pages Using Features of Rendering HTML']HTML Elements in the Web Browser
    Fernandez, F. J.
    Alvarez, Jose L.
    Abad, Pedro J.
    Jimenez, Patricia
    [J]. TRENDS IN PRACTICAL APPLICATIONS OF AGENTS AND MULTI-AGENTS SYSTEMS, 2011, 90 : 161 - 168
  • [2] Creating Web pages with HTML']HTML
    McClees, M
    [J]. NURSING INFORMATICS: THE IMPACT OF NURSING KNOWLEDGE ON HEALTH CARE INFORMATICS, 1997, 46 : 561 - 561
  • [3] Integration of HTML']HTML Tables in Web Pages
    Akbar, Memen
    Azizah, Fazat Nur
    Saptawati, G. A. Putri
    [J]. 2015 INTERNATIONAL CONFERENCE ON DATA AND SOFTWARE ENGINEERING (ICODSE), 2015, : 132 - 137
  • [4] Converting Web Pages Mockups to HTML']HTML using Machine Learning
    Boucas, Tiago
    Esteves, Antonio
    [J]. PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES (WEBIST), 2020, : 217 - 224
  • [5] Creating cool HTML']HTML 4 Web pages
    Lisberg, B
    [J]. TECHNICAL COMMUNICATION, 1999, 46 (02) : 265 - 266
  • [6] A General Learning Method for Automatic Title Extraction from HTML']HTML Pages
    Changuel, Sahar
    Labroche, Nicolas
    Bouchon-Meunier, Bernadette
    [J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2009, 5632 : 704 - 718
  • [7] HTML']HTML-LSTM: Information Extraction from HTML']HTML Tables in Web Pages Using Tree-Structured LSTM
    Kawamura, Kazuki
    Yamamoto, Akihiro
    [J]. DISCOVERY SCIENCE (DS 2021), 2021, 12986 : 29 - 43
  • [8] HTML editors: Web pages in minutes
    Don E. Descy
    [J]. TechTrends, 1999, 43 (4) : 5 - 7
  • [9] SEDE: A Schema Explorer and Data Extractor for HTML']HTML Web Pages
    Deng, Xubin
    [J]. INFORMATION AND MANAGEMENT ENGINEERING, PT VI, 2011, 236 : 26 - 33
  • [10] Finding and using the content texts of HTML']HTML pages
    Ma, Jun
    Chen, Zhumin
    Lian, Li
    Li, Lianxia
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, 2008, 4993 : 656 - 662