Title identification of web article pages using HTML']HTML and visual features

被引:2
|
作者
Fan, Jian [1 ]
Luo, Ping [2 ]
Joshi, Parag [1 ]
机构
[1] Hewlett Packard Labs, 1501 Page Mill Rd, Palo Alto, CA 94304 USA
[2] Hewlett Packard Labs, Beijing 100084, Peoples R China
来源
IMAGING AND PRINTING IN A WEB 2.0 WORLD II | 2011年 / 7879卷
关键词
data extraction; web article extraction; title identification;
D O I
10.1117/12.876708
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] USING COOLLISTS TO INDEX HTML']HTML DOCUMENTS IN THE WEB
    LIM, JG
    COMPUTER NETWORKS AND ISDN SYSTEMS, 1995, 28 (1-2): : 147 - 154
  • [22] Automatic Detection of Visibility Faults by Layout Changes in HTML']HTML5 Web Pages
    Ryou, Yeonhee
    Ryu, Sukyoung
    2018 IEEE 11TH INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION (ICST), 2018, : 182 - 192
  • [23] SurveyWiz and FactorWiz: Java']JavaScript Web pages that make HTML']HTML forms for research on the Internet
    Birnbaum, MH
    BEHAVIOR RESEARCH METHODS INSTRUMENTS & COMPUTERS, 2000, 32 (02): : 339 - 346
  • [24] Multipurpose Web publishing using HTML']HTML, XML, and CSS
    Lie, HW
    Saarela, J
    COMMUNICATIONS OF THE ACM, 1999, 42 (10) : 95 - 101
  • [25] Phishing Web Page Detection Methods: URL and HTML']HTML Features Detection
    Humam, Faris
    Setiadi, Yazid
    2020 IEEE INTERNATIONAL CONFERENCE ON INTERNET OF THINGS AND INTELLIGENCE SYSTEM (IOTAIS), 2021, : 167 - 171
  • [26] HTML']HTML Web Content Extraction Using Paragraph Tags
    Carey, Howard J., III
    Manic, Milos
    PROCEEDINGS 2016 IEEE 25TH INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS (ISIE), 2016, : 1099 - 1104
  • [27] HTML']HTML for the World Wide Web with XHTML']HTML and CSS: Visual QuickStart guide, 5th edition
    Gordon, RS
    LIBRARY JOURNAL, 2003, 128 (02) : 111 - 111
  • [28] HTML']HTML5 Visual Composition of REST-like Web Services
    Marino, Enrico
    Spini, Federico
    Paoluzzi, Alberto
    Minuti, Fabrizio
    Rosina, Maurizio
    Bottaro, Antonio
    PROCEEDINGS OF 2013 IEEE 4TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2012, : 49 - 55
  • [29] Research of Extracting Data from HTML Web Pages Automatically
    王茹
    宋瀚涛
    陆玉昌
    Journal of Beijing Institute of Technology, 2003, (S1) : 104 - 108
  • [30] Research of Extracting Data from HTML Web Pages Automatically
    王茹
    宋瀚涛
    陆玉昌
    Journal of Beijing Institute of Technology(English Edition), 2003, (English Edition) : 104 - 108