Title identification of web article pages using HTML']HTML and visual features

被引:2
|
作者
Fan, Jian [1 ]
Luo, Ping [2 ]
Joshi, Parag [1 ]
机构
[1] Hewlett Packard Labs, 1501 Page Mill Rd, Palo Alto, CA 94304 USA
[2] Hewlett Packard Labs, Beijing 100084, Peoples R China
关键词
data extraction; web article extraction; title identification;
D O I
10.1117/12.876708
中图分类号
O43 [光学];
学科分类号
070207 ; 0803 ;
摘要
Extracting informative content from Web article pages has many applications such as printing and content reuse. Title is a very significant and unique component of an article. However, identifying the true title is not an easy problem even for human readers. In this paper, we present a title identification method that takes into account of several features including the title field of the HTML page and HTML tag of a DOM node as well as font size and horizontal alignment. We tested our method on a ground truth data set consisting of 1993 pages from 98 web sites and achieved 97.5% accuracy, about 20% above a baseline method based on only the font size.
引用
收藏
页数:5
相关论文
共 50 条
  • [41] A stacking model using URL and HTML']HTML features for phishing webpage detection
    Li, Yukun
    Yang, Zhenguo
    Chen, Xu
    Yuan, Huaping
    Liu, Wenyin
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 94 : 27 - 39
  • [42] An effective detection approach for phishing websites using URL and HTML']HTML features
    Aljofey, Ali
    Jiang, Qingshan
    Rasool, Abdur
    Chen, Hui
    Liu, Wenyin
    Qu, Qiang
    Wang, Yang
    [J]. SCIENTIFIC REPORTS, 2022, 12 (01)
  • [43] Wikxhibit: Using HTML']HTML and Wikidata to Author Applications that Link Data Across the Web
    Alrashed, Tarfah
    Verou, Lea
    Karger, David R.
    [J]. PROCEEDINGS OF THE 35TH ANNUAL ACM SYMPOSIUM ON USER INTERFACE SOFTWARE AND TECHNOLOGY, UIST 2022, 2022,
  • [44] Deploying Web-based Control Laboratory Using HTML']HTML5
    Lei, Zhongcheng
    Hu, Wenshan
    Zhou, Hong
    [J]. PROCEEDINGS OF 2016 13TH INTERNATIONAL CONFERENCE ON REMOTE ENGINEERING AND VIRTUAL INSTRUMENTATION (REV), 2016, : 69 - 73
  • [45] Content-rich web document segmentation based on HTML']HTML tag structures and visual cues
    Li, Longzhuang
    Liu, Yonghuai
    Fernandez, John
    [J]. 3RD INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 3, PROCEEDINGS, 2005, : 159 - 164
  • [46] Deployment of a Web-based Control Laboratory Using HTML']HTML5
    Lei, Zhongcheng
    Hu, Wenshan
    Zhou, Hong
    [J]. INTERNATIONAL JOURNAL OF ONLINE ENGINEERING, 2016, 12 (07) : 18 - 23
  • [47] A method of readability assessment for web documents using text features and HTML structures
    Yamasaki, Takahiro
    Tokiwa, Kin-Ichiroh
    [J]. IEEJ Transactions on Electronics, Information and Systems, 2012, 132 (09) : 1524 - 1532
  • [48] SurveyWiz and factorWiz: Javascript Web pages that make HTML forms for research on the internet
    Michael H. Birnbaum
    [J]. Behavior Research Methods, Instruments, & Computers, 2000, 32 : 339 - 346
  • [49] HTML']HTML5 Powered Web Application for Telecardiology: A Case Study using ECGs
    Kumar, M. Arun
    Srinivasan, Anand
    Bussa, Nagaraju
    [J]. 2013 IEEE POINT-OF-CARE HEALTHCARE TECHNOLOGIES (PHT), 2013, : 156 - 159
  • [50] Using HTML']HTML5 to prevent detection of drive-by-download web malware
    De Santis, Alfredo
    De Maio, Giancarlo
    Petrillo, Umberto Ferraro
    [J]. SECURITY AND COMMUNICATION NETWORKS, 2015, 8 (07) : 1237 - 1255