Title identification of web article pages using HTML']HTML and visual features

被引:2
|
作者
Fan, Jian [1 ]
Luo, Ping [2 ]
Joshi, Parag [1 ]
机构
[1] Hewlett Packard Labs, 1501 Page Mill Rd, Palo Alto, CA 94304 USA
[2] Hewlett Packard Labs, Beijing 100084, Peoples R China
来源
IMAGING AND PRINTING IN A WEB 2.0 WORLD II | 2011年 / 7879卷
关键词
data extraction; web article extraction; title identification;
D O I
10.1117/12.876708
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
引用
收藏
页数:5
相关论文
共 50 条
  • [41] A Visual Spreadsheet using HTML']HTML5 for Whole Genome Display
    Alhirabi, Nada
    Butler, Greg
    2015 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2015, : 451 - 457
  • [42] A stacking model using URL and HTML']HTML features for phishing webpage detection
    Li, Yukun
    Yang, Zhenguo
    Chen, Xu
    Yuan, Huaping
    Liu, Wenyin
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 94 : 27 - 39
  • [43] An effective detection approach for phishing websites using URL and HTML']HTML features
    Aljofey, Ali
    Jiang, Qingshan
    Rasool, Abdur
    Chen, Hui
    Liu, Wenyin
    Qu, Qiang
    Wang, Yang
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [44] Wikxhibit: Using HTML']HTML and Wikidata to Author Applications that Link Data Across the Web
    Alrashed, Tarfah
    Verou, Lea
    Karger, David R.
    PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY, UIST 2022, 2022,
  • [45] Content-rich web document segmentation based on HTML']HTML tag structures and visual cues
    Li, Longzhuang
    Liu, Yonghuai
    Fernandez, John
    3RD INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 3, PROCEEDINGS, 2005, : 159 - 164
  • [46] Deploying Web-based Control Laboratory Using HTML']HTML5
    Lei, Zhongcheng
    Hu, Wenshan
    Zhou, Hong
    PROCEEDINGS OF 2016 13TH INTERNATIONAL CONFERENCE ON REMOTE ENGINEERING AND VIRTUAL INSTRUMENTATION (REV), 2016, : 69 - 73
  • [47] Deployment of a Web-based Control Laboratory Using HTML']HTML5
    Lei, Zhongcheng
    Hu, Wenshan
    Zhou, Hong
    INTERNATIONAL JOURNAL OF ONLINE ENGINEERING, 2016, 12 (07) : 18 - 23
  • [48] A method of readability assessment for web documents using text features and HTML structures
    Yamasaki, Takahiro
    Tokiwa, Kin-Ichiroh
    IEEJ Transactions on Electronics, Information and Systems, 2012, 132 (09) : 1524 - 1532
  • [49] SurveyWiz and factorWiz: Javascript Web pages that make HTML forms for research on the internet
    Michael H. Birnbaum
    Behavior Research Methods, Instruments, & Computers, 2000, 32 : 339 - 346
  • [50] HTML']HTML5 Powered Web Application for Telecardiology: A Case Study using ECGs
    Kumar, M. Arun
    Srinivasan, Anand
    Bussa, Nagaraju
    2013 IEEE POINT-OF-CARE HEALTHCARE TECHNOLOGIES (PHT), 2013, : 156 - 159