Web Page Segmentation Revisited: Evaluation Framework and Dataset

被引:9
|
作者
Kiesel, Johannes [1 ]
Kneist, Florian [1 ]
Meyer, Lars [1 ]
Komlossy, Kristof [1 ]
Stein, Benno [1 ]
Potthast, Martin [2 ]
机构
[1] Bauhaus Univ Weimar, Weimar, Germany
[2] Univ Leipzig, Leipzig, Germany
关键词
D O I
10.1145/3340531.3412782
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Each web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations are still lacking. Specifically, the developed evaluation methods and datasets presume a certain downstream task, which led to a variety of incompatible datasets and evaluation methods. To address this shortcoming, we contribute two resources: (1) An evaluation framework which can be adjusted to downstream tasks by measuring the segmentation similarity regarding visual, structural, and textual elements, and which includes measures for annotator agreement, segmentation quality, and an algorithm for segmentation fusion. (2) The Webis-WebSeg-20 dataset, comprising 42,450 crowdsourced segmentations for 8,490 web pages, outranging existing sources by an order of magnitude. Our results help to better understand the "mental segmentation model" of human annotators: Among other things we find that annotators mostly agree on segmentations for all kinds of web page elements (visual, structural, and textual). Disagreement exists mostly regarding the right level of granularity, indicating a general agreement on the visual structure of web pages.
引用
收藏
页码:3047 / 3054
页数:8
相关论文
共 50 条
  • [41] Enhanced Gestalt Theory Guided Web Page Segmentation for Mobile Browsing
    Yang, Xin
    Shi, Yuanchun
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 46 - 49
  • [42] Towards an Improved Vision-based Web Page Segmentation Algorithm
    Cormier, Michael
    Mann, Richard
    Moffatt, Karyn
    Cohen, Robin
    2017 14TH CONFERENCE ON COMPUTER AND ROBOT VISION (CRV 2017), 2017, : 345 - 352
  • [43] An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers
    Liebl, Bernhard
    Burghardt, Manuel
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 5153 - 5160
  • [44] A Novel Feature Selection Framework for Automatic Web Page Classification
    J.Alamelu Mangai
    V.Santhosh Kumar
    S.Appavu alias Balamurugan
    International Journal of Automation and Computing, 2012, (04) : 442 - 448
  • [45] A Novel Feature Selection Framework for Automatic Web Page Classification
    JAlamelu Mangai
    VSanthosh Kumar
    SAppavu alias Balamurugan
    International Journal of Automation & Computing , 2012, (04) : 442 - 448
  • [46] A framework to derive web page context from hyperlink structure
    Chauhan, Naresh
    Sharma, A.K.
    International Journal of Information and Communication Technology, 2008, 1 (3-4) : 329 - 346
  • [47] A Novel Feature Selection Framework for Automatic Web Page Classification
    Mangai, J. Alamelu
    Kumar, V. Santhosh
    Balamurugan, S. Appavu Alias
    INTERNATIONAL JOURNAL OF AUTOMATION AND COMPUTING, 2012, 9 (04) : 442 - 448
  • [48] An automatic performance evaluation method for document page segmentation
    Peng, LR
    Chen, M
    Liu, CS
    Ding, XQ
    Zheng, JR
    SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 134 - 137
  • [49] Software architecture of PSET: A page segmentation evaluation toolkit
    Mao S.
    Kanungo T.
    International Journal on Document Analysis and Recognition, 2002, 4 (3) : 205 - 217
  • [50] Word segmentation and recognition for Web document framework
    Chi, CH
    Ding, C
    Lim, A
    PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION KNOWLEDGE MANAGEMENT, CIKM'99, 1999, : 458 - 465