Title identification of web article pages using HTML']HTML and visual features

被引:2
|
作者
Fan, Jian [1 ]
Luo, Ping [2 ]
Joshi, Parag [1 ]
机构
[1] Hewlett Packard Labs, 1501 Page Mill Rd, Palo Alto, CA 94304 USA
[2] Hewlett Packard Labs, Beijing 100084, Peoples R China
关键词
data extraction; web article extraction; title identification;
D O I
10.1117/12.876708
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
引用
收藏
页数:5
相关论文
共 50 条
  • [31] Web content topic modeling using LDA and HTML']HTML tags
    Altarturi, Hamza H. M.
    Saadoon, Muntadher
    Anuar, Nor Badrul
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [32] Using microsoft access and HTML']HTML to produce browseable Web lists
    Opasik, SA
    INFORMATION TECHNOLOGY AND LIBRARIES, 2002, 21 (03) : 127 - 129
  • [33] An HTML']HTML5 Implementation of Web-Corn for Recording Chalk Annotations and Talk Voices onto Web Pages
    Ogawa, Shuji
    Niibori, Michitoshi
    Yonekura, Tatsuhiro
    Kamada, Masaru
    ADVANCES IN NETWORK-BASED INFORMATION SYSTEMS, NBIS-2017, 2018, 7 : 1070 - 1075
  • [34] Look before you leap: Detecting phishing web pages by exploiting raw URL and HTML']HTML characteristics
    Opara, Chidimma
    Chen, Yingke
    Wei, Bo
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 236
  • [35] Using HTML']HTML5 Web Interface for Visualization and Control System
    Stribny, Martin
    Smutny, Pavel
    PROCEEDINGS OF THE 2013 14TH INTERNATIONAL CARPATHIAN CONTROL CONFERENCE (ICCC), 2013, : 363 - 366
  • [36] Using Fuzzy Logic to Leverage HTML']HTML Markup for Web Page Representation
    Garcia-Plaza, Alberto P.
    Fresno, Victor
    Martinez Unanue, Raquel
    Zubiaga, Arkaitz
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2017, 25 (04) : 919 - 933
  • [38] Using HTML']HTML metadata to find relevant images on the World Wide Web
    Tsymbalenko, Y
    Munson, EV
    IC'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTERNET COMPUTING, VOLS I AND II, 2001, : 842 - 848
  • [39] Web browser as medical educator/researcher using HTML']HTML & Java']JavaScript
    Johnson, CW
    Oser, G
    Abedor, AJ
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 1998, : 1023 - 1023
  • [40] Global project documentation and communications using HTML']HTML on the World Wide Web
    Liu, LY
    Stumpf, AL
    Chin, SY
    COMPUTING IN CIVIL ENGINEERING, 1996, : 15 - 20