Web Page Segmentation Revisited: Evaluation Framework and Dataset

被引:9
|
作者
Kiesel, Johannes [1 ]
Kneist, Florian [1 ]
Meyer, Lars [1 ]
Komlossy, Kristof [1 ]
Stein, Benno [1 ]
Potthast, Martin [2 ]
机构
[1] Bauhaus Univ Weimar, Weimar, Germany
[2] Univ Leipzig, Leipzig, Germany
关键词
D O I
10.1145/3340531.3412782
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Each web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations are still lacking. Specifically, the developed evaluation methods and datasets presume a certain downstream task, which led to a variety of incompatible datasets and evaluation methods. To address this shortcoming, we contribute two resources: (1) An evaluation framework which can be adjusted to downstream tasks by measuring the segmentation similarity regarding visual, structural, and textual elements, and which includes measures for annotator agreement, segmentation quality, and an algorithm for segmentation fusion. (2) The Webis-WebSeg-20 dataset, comprising 42,450 crowdsourced segmentations for 8,490 web pages, outranging existing sources by an order of magnitude. Our results help to better understand the "mental segmentation model" of human annotators: Among other things we find that annotators mostly agree on segmentations for all kinds of web page elements (visual, structural, and textual). Disagreement exists mostly regarding the right level of granularity, indicating a general agreement on the visual structure of web pages.
引用
收藏
页码:3047 / 3054
页数:8
相关论文
共 50 条
  • [1] Web Page Segmentation Evaluation
    Sanoja, Andres
    Gancarski, Stephane
    30TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, VOLS I AND II, 2015, : 753 - 760
  • [2] Block-o-Matic: A Web Page Segmentation Framework
    Sanoja, Andres
    Gancarski, Stephane
    2014 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS), 2014, : 601 - 606
  • [3] Web Page Segmentation with Structured Prediction and its Application in Web Page Classification
    Bing, Lidong
    Guo, Rui
    Lam, Wai
    Niu, Zheng-Yu
    Wang, Haifeng
    SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 767 - 776
  • [4] WooIR: A New Open Page Stream Segmentation Dataset
    van Heusden, Ruben
    Kamps, Jaap
    Marx, Maarten
    PROCEEDINGS OF THE 2022 ACM SIGIR INTERNATIONAL CONFERENCE ON THE THEORY OF INFORMATION RETRIEVAL, ICTIR 2022, 2022, : 165 - 174
  • [5] Web Page Segmentation and its Application for Web Information Crawling
    Feng, Hanyang
    Zhang, Wenzhe
    Wu, Hesheng
    Wang, Chong-Jun
    2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2016), 2016, : 598 - 605
  • [6] Web Page Segmentation Towards Information Extraction for Web Semantics
    Malhotra, Pooja
    Malik, Sanjay Kumar
    INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING AND COMMUNICATIONS, VOL 2, 2019, 56 : 431 - 442
  • [7] Web page dependent vision based segmentation for web sites
    Ko, Pyungkwan
    Kang, Sanggil
    Kumar, Harshit
    7TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE IN CONJUNCTION WITH 2ND IEEE/ACIS INTERNATIONAL WORKSHOP ON E-ACTIVITY, PROCEEDINGS, 2008, : 690 - +
  • [8] A Novel Method for the Web page Segmentation And Identification
    Wang, Jing
    Liu, Zhijing
    2009 INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND TECHNOLOGY, VOL I, PROCEEDINGS, 2009, : 229 - 231
  • [9] A Framework for Web Page Rank Prediction
    Voudigari, Elli
    Pavlopoulos, John
    Vazirgiannis, Michalis
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, PT II, 2011, 364 : 240 - 249
  • [10] Web page segmentation based on Gestalt theory
    Xiang, Peifeng
    Yang, Xin
    Shi, Yuanchun
    2007 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-5, 2007, : 2253 - 2256