Web Page Segmentation Revisited: Evaluation Framework and Dataset

被引:9
|
作者
Kiesel, Johannes [1 ]
Kneist, Florian [1 ]
Meyer, Lars [1 ]
Komlossy, Kristof [1 ]
Stein, Benno [1 ]
Potthast, Martin [2 ]
机构
[1] Bauhaus Univ Weimar, Weimar, Germany
[2] Univ Leipzig, Leipzig, Germany
关键词
D O I
10.1145/3340531.3412782
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Each web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations are still lacking. Specifically, the developed evaluation methods and datasets presume a certain downstream task, which led to a variety of incompatible datasets and evaluation methods. To address this shortcoming, we contribute two resources: (1) An evaluation framework which can be adjusted to downstream tasks by measuring the segmentation similarity regarding visual, structural, and textual elements, and which includes measures for annotator agreement, segmentation quality, and an algorithm for segmentation fusion. (2) The Webis-WebSeg-20 dataset, comprising 42,450 crowdsourced segmentations for 8,490 web pages, outranging existing sources by an order of magnitude. Our results help to better understand the "mental segmentation model" of human annotators: Among other things we find that annotators mostly agree on segmentations for all kinds of web page elements (visual, structural, and textual). Disagreement exists mostly regarding the right level of granularity, indicating a general agreement on the visual structure of web pages.
引用
收藏
页码:3047 / 3054
页数:8
相关论文
共 50 条
  • [21] An IR-based Evaluation Framework for Web Search Query Segmentation
    Roy, Rishiraj Saha
    Ganguly, Niloy
    Choudhury, Monojit
    Laxman, Srivatsan
    SIGIR 2012: PROCEEDINGS OF THE 35TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2012, : 881 - 890
  • [22] Behavior based web page evaluation
    Velayathan, Ganesan
    Yamada, Seiji
    JOURNAL OF WEB ENGINEERING, 2007, 6 (03): : 222 - 243
  • [23] Behavior based web page evaluation
    Graduate University for Advanced Studies, National Institute of Informatics , 2-1-2 Hitotsubashi, Chiyoda, 101-8430 Tokyo, Japan
    不详
    Int. World Wide Web Conf., (1317-1318):
  • [24] A segmentation method for web page analysis using shrinking and dividing
    Cao, Jiuxin
    Mao, Bo
    Luo, Junzhou
    INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2010, 25 (02) : 93 - 104
  • [25] Accelerating the process of web page segmentation via template clustering
    Zeleny J.
    Burget R.
    International Journal of Intelligent Information and Database Systems, 2016, 9 (02) : 134 - 154
  • [26] Web Page Segmentation Based on the Hough transform and Vision Cues
    Wei, Tingting
    Lu, Yonghe
    Li, Xuanjie
    Liu, Jinglun
    2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 865 - 872
  • [27] A Block Gathering Based on Mobile Web Page Segmentation Algorithm
    Wu, Libing
    Ke, Yalin
    He, Yanxiang
    Liu, Nan
    TRUSTCOM 2011: 2011 INTERNATIONAL JOINT CONFERENCE OF IEEE TRUSTCOM-11/IEEE ICESS-11/FCST-11, 2011, : 1425 - 1430
  • [28] A web page segmentation algorithm based on Iterated Dividing and Shrinking
    Cao Jiuxin
    Mao Bo
    Luo Junzhou
    2007 IFIP INTERNATIONAL CONFERENCE ON NETWORK AND PARALLEL COMPUTING WORKSHOPS, PROCEEDINGS, 2007, : 701 - 705
  • [29] Proposal of Seam Degree and Content Similarity for Web Page Segmentation
    Zeng, Jun
    Flanagan, Brendan
    Xiong, Qingyu
    Wen, Junhao
    Hirokawa, Sachio
    2013 SECOND IIAI INTERNATIONAL CONFERENCE ON ADVANCED APPLIED INFORMATICS (IIAI-AAI 2013), 2013, : 9 - 14
  • [30] Toward semantic annotation of Web page's segmentation blocks
    Cosulschi, Mirel
    ANNALS OF THE UNIVERSITY OF CRAIOVA-MATHEMATICS AND COMPUTER SCIENCE SERIES, 2010, 37 (03): : 92 - 100