Accurate and efficient general-purpose boilerplate detection for crawled web corpora

被引:0
|
作者
Roland Schäfer
机构
[1] Freie Universität Berlin,Deutsche und niederländische Philologie
来源
关键词
Corpus construction; Web corpora; Boilerplate; Non-destructive corpus normalization;
D O I
暂无
中图分类号
学科分类号
摘要
Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results for search terms if these terms appear in boilerplate regions of the web page. In this paper, I present and evaluate a supervised machine-learning approach to general-purpose boilerplate detection for languages based on Latin alphabets using Multi-Layer Perceptrons (MLPs). It is both very efficient and very accurate (between 95 % and 99%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$99\,\%$$\end{document} correct classifications, depending on the input language). I show that language-specific classifiers greatly improve the accuracy of boilerplate detectors. The single features used for the classification are evaluated with regard to the merit they contribute to the classification. Furthermore, I show that the accuracy of the MLP is on a par with that of a wide range of other classifiers. My approach has been implemented in the open-source texrex web page cleaning software, and large corpora constructed using it are available from the COW initiative, including the CommonCOW corpora created from CommonCrawl datasets.
引用
收藏
页码:873 / 889
页数:16
相关论文
共 50 条
  • [1] Accurate and efficient general-purpose boilerplate detection for crawled web corpora
    Schaefer, Roland
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (03) : 873 - 889
  • [2] Accurate Hub Assembly for a General-Purpose Trailer
    Mitin E.V.
    Sul’din S.P.
    Russian Engineering Research, 2023, 43 (05) : 581 - 584
  • [3] AN EFFICIENT GENERAL-PURPOSE PARALLEL COMPUTER
    GALIL, Z
    PAUL, WJ
    JOURNAL OF THE ACM, 1983, 30 (02) : 360 - 387
  • [4] General-purpose compression for efficient retrieval
    Cannane, A
    Williams, HE
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2001, 52 (05): : 430 - 437
  • [5] Optimizing General-Purpose CPUs for Energy-Efficient Mobile Web Computing
    Zhu, Yuhao
    Reddi, Vijay Janapa
    ACM TRANSACTIONS ON COMPUTER SYSTEMS, 2017, 35 (01):
  • [6] EFFICIENT GENERAL-PURPOSE PARALLEL COMPUTER.
    Galil, Zvi
    Paul, Wolfang J.
    Journal of the ACM, 1983, 30 (02): : 360 - 387
  • [7] FreeEnCal Web: a Web Service of Automated Forward Reasoning for General-purpose
    Otsuka, Takumi
    Fukushi, Kentaro
    Goto, Yuichi
    Cheng, Jingde
    2018 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2018, : 180 - 185
  • [8] A LOW-COST RELIABLE AND ACCURATE GENERAL-PURPOSE TIMER
    ROTH, MJ
    LANDIS, D
    ROSENBERG, B
    SILVER, CA
    JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR, 1967, 10 (04) : 383 - +
  • [9] Efficient Utilization of SIMD Engines for General-Purpose Processors
    Huang, Libo
    Wang, Zhiying
    Xiao, Nong
    Dou, Qiang
    COMPUTER JOURNAL, 2014, 57 (08): : 1141 - 1154
  • [10] Runtime reconfiguration techniques for efficient general-purpose computation
    Xu, BX
    Albonesi, DH
    IEEE DESIGN & TEST OF COMPUTERS, 2000, 17 (01): : 42 - 52