WooIR: A New Open Page Stream Segmentation Dataset

被引:2
|
作者
van Heusden, Ruben [1 ]
Kamps, Jaap [1 ]
Marx, Maarten [1 ]
机构
[1] Univ Amsterdam, Amsterdam, Netherlands
关键词
Page Stream Segmentation; Text classification; Clustering; Metrics; Benchmark; TEXT; CLASSIFICATION;
D O I
10.1145/3539813.3545150
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work we present WooIR, an open realistic benchmark for Page Stream Segmentation (PSS), the task of recovering document boundaries from aggregated streams of pages. Our dataset consists of over 200 streams of scanned in documents, 7K documents, 45K pages and 10M words, originating from documents released by the Dutch government in response to requests made under the Freedom of Information Act. Apart from the introduction of the dataset we perform several baseline experiments on the dataset and compare six metrics for the PSS task, in an attempt to unify the field in the usage of evaluation metrics more suited to the task. Analysis of the six metrics on the WooIR dataset shows that the dataset contains a good balance of easy and hard samples. The Panoptic Quality metric from the image segmentation field seems the most appropriate evaluation metric for the PSS task.
引用
收藏
页码:165 / 174
页数:10
相关论文
共 50 条
  • [1] OpenPSS: An Open Page Stream Segmentation Benchmark
    van Heusden, Ruben
    Kamps, Jaap
    Marx, Maarten
    LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES, PT I, TPDL 2024, 2024, 15177 : 413 - 429
  • [2] Web Page Segmentation Revisited: Evaluation Framework and Dataset
    Kiesel, Johannes
    Kneist, Florian
    Meyer, Lars
    Komlossy, Kristof
    Stein, Benno
    Potthast, Martin
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 3047 - 3054
  • [3] Leveraging effectiveness and efficiency in Page Stream Deep Segmentation
    Braz, Fabricio Ataides
    Silva, Nilton Correia da
    Lima, Jonathan Alis Salgado
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 105
  • [4] Video stream segmentation method based on video page
    Zhu, Miao-Liang
    Wang, Dong-Hui
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design & Computer Graphics, 2000, 12 (08): : 585 - 589
  • [5] Deep Neural Networks for Page Stream Segmentation and Classification
    Gallo, Ignazio
    Noce, Lucia
    Zamberletti, Alessandro
    Calefati, Alessandro
    2016 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2016, : 127 - 133
  • [6] Document Classification and Page Stream Segmentation for Digital Mailroom Applications
    Gordo, Albert
    Al Rusinol, Marcal
    Karatzas, Dimosthenis
    Bagdanov, Andrew D.
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 621 - 625
  • [7] Tab this Folder of Documents: Page Stream Segmentation of Business Documents
    Mungmeeprued, Thisanaporn
    Ma, Yuxin
    Mehta, Nisarg
    Lipani, Aldo
    PROCEEDINGS OF THE 2022 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, DOCENG 2022, 2022,
  • [9] Multi-modal page stream segmentation with convolutional neural networks
    Wiedemann, Gregor
    Heyer, Gerhard
    LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (01) : 127 - 150
  • [10] Multi-modal page stream segmentation with convolutional neural networks
    Gregor Wiedemann
    Gerhard Heyer
    Language Resources and Evaluation, 2021, 55 : 127 - 150