Web Page Segmentation Revisited: Evaluation Framework and Dataset

被引:9
|
作者
Kiesel, Johannes [1 ]
Kneist, Florian [1 ]
Meyer, Lars [1 ]
Komlossy, Kristof [1 ]
Stein, Benno [1 ]
Potthast, Martin [2 ]
机构
[1] Bauhaus Univ Weimar, Weimar, Germany
[2] Univ Leipzig, Leipzig, Germany
关键词
D O I
10.1145/3340531.3412782
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Each web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations are still lacking. Specifically, the developed evaluation methods and datasets presume a certain downstream task, which led to a variety of incompatible datasets and evaluation methods. To address this shortcoming, we contribute two resources: (1) An evaluation framework which can be adjusted to downstream tasks by measuring the segmentation similarity regarding visual, structural, and textual elements, and which includes measures for annotator agreement, segmentation quality, and an algorithm for segmentation fusion. (2) The Webis-WebSeg-20 dataset, comprising 42,450 crowdsourced segmentations for 8,490 web pages, outranging existing sources by an order of magnitude. Our results help to better understand the "mental segmentation model" of human annotators: Among other things we find that annotators mostly agree on segmentations for all kinds of web page elements (visual, structural, and textual). Disagreement exists mostly regarding the right level of granularity, indicating a general agreement on the visual structure of web pages.
引用
收藏
页码:3047 / 3054
页数:8
相关论文
共 50 条
  • [31] A unified probabilistic framework for web page scoring systems
    Diligenti, M
    Gori, M
    Maggini, M
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2004, 16 (01) : 4 - 16
  • [32] Empirical performance evaluation of page segmentation algorithms
    Mao, S
    Kanungo, T
    DOCUMENT RECOGNITION AND RETRIEVAL VII, 2000, 3967 : 303 - 314
  • [33] A Framework to Harvest Page Views of Web for Banner Advertising
    Reddy, P. Krishna
    BIG DATA ANALYTICS, BDA 2015, 2015, 9498 : 57 - 68
  • [34] Knowledge fusion framework based on Web page texts
    Hu, Sikang
    Cao, Yuanda
    FRONTIERS OF COMPUTER SCIENCE IN CHINA, 2009, 3 (04): : 457 - 464
  • [35] Knowledge fusion framework based on Web page texts
    Sikang Hu
    Yuanda Cao
    Frontiers of Computer Science in China, 2009, 3 : 457 - 464
  • [36] A Visual Based Page Segmentation for Deep Web Data Extraction
    Palekar, Vikas R.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING (SOCPROS 2011), VOL 2, 2012, 131 : 791 - 804
  • [37] A Web Page Segmentation Method by using Headlines to Web Contents as Separators and its Evaluations
    Sano, Hiroyuki
    Swezey, Robin M. E.
    Shiramatsu, Shun
    Ozono, Tadachika
    Shintani, Toramatsu
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2013, 13 (01): : 1 - 6
  • [38] Behavior-based web page evaluation
    Velayathan, Ganesan
    Yamada, Seiji
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WORKSHOPS PROCEEDINGS, 2006, : 409 - +
  • [39] Augmented dataset for multidimensional ballast segmentation and evaluation
    Ding, K.
    Luo, J.
    Huang, H.
    Hart, J. M.
    Qamhia, I. I. A.
    Tutumluer, E.
    GEOSHANGHAI 2024 INTERNATIONAL CONFERENCE, VOL 3, 2024, 1332
  • [40] An Evaluation Framework for Intrusion Detection Dataset
    Gharib, Amirhossein
    Sharafaldin, Iman
    Lashkari, Arash Habibi
    Ghorbani, Ali A.
    2016 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND SECURITY (ICISS), 2014, : 41 - 45