A survey of historical document image datasets

被引:10
|
作者
Nikolaidou, Konstantina [1 ]
Seuret, Mathias [2 ]
Mokayed, Hamam [1 ]
Liwicki, Marcus [1 ]
机构
[1] Lulea Univ Technol, EISLAB Machine Learning Grp, Aurorum 1, S-97187 Lulea, Norrbotten, Sweden
[2] Friedrich Alexander Univ, Pattern Recognit Lab Comp Vis Grp, Martensstr 3, D-91058 Erlangen, Bavaria, Germany
关键词
Historical documents; Image datasets; Document image analysis; Machine learning; HANDWRITTEN TEXT RECOGNITION; ICFHR; 2018; COMPETITION; HIDDEN MARKOV-MODELS; WRITER IDENTIFICATION; SEGMENTATION; LINE; SYSTEM; BINARIZATION; FEATURES; EXTRACTION;
D O I
10.1007/s10032-022-00405-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a systematic literature review of image datasets for document image analysis, focusing on historical documents, such as handwritten manuscripts and early prints. Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms. However, because of the very large variety of the actual data (e.g., scripts, tasks, dates, support systems, and amount of deterioration), the different formats for data and label representation, and the different evaluation processes and benchmarks, finding appropriate datasets is a difficult task. This work fills this gap, presenting a meta-study on existing datasets. After a systematic selection process (according to PRISMA guidelines), we select 65 studies that are chosen based on different factors, such as the year of publication, number of methods implemented in the article, reliability of the chosen algorithms, dataset size, and journal outlet. We summarize each study by assigning it to one of three pre-defined tasks: document classification, layout structure, or content analysis. We present the statistics, document type, language, tasks, input visual aspects, and ground truth information for every dataset. In addition, we provide the benchmark tasks and results from these papers or recent competitions. We further discuss gaps and challenges in this domain. We advocate for providing conversion tools to common formats (e.g., COCO format for computer vision tasks) and always providing a set of evaluation metrics, instead of just one, to make results comparable across studies.
引用
收藏
页码:305 / 338
页数:34
相关论文
共 50 条
  • [1] A survey of historical document image datasets
    Konstantina Nikolaidou
    Mathias Seuret
    Hamam Mokayed
    Marcus Liwicki
    [J]. International Journal on Document Analysis and Recognition (IJDAR), 2022, 25 : 305 - 338
  • [2] Historical document image binarization
    Mello, Carlos A. B.
    Oliveira, Adriano L. I.
    Sanchez, Angel
    [J]. VISAPP 2008: PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 1, 2008, : 108 - 113
  • [3] Historical Document Image Binarization: A Review
    Tensmeyer C.
    Martinez T.
    [J]. SN Computer Science, 2020, 1 (3)
  • [4] Document Image Retrieval: A Survey
    Tursun, Gulzira
    Aysa, Yunus
    Amrulla, Guzalnur
    Aysa, Alimjan
    Ubul, Kurban
    [J]. INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND COMMUNICATION ENGINEERING (CSCE 2015), 2015, : 1317 - 1324
  • [5] An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics
    Koh, Huan Yee
    Ju, Jiaxin
    Liu, Ming
    Pan, Shirui
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (08)
  • [6] A Hybrid Method for Historical Degraded Document Image
    NaouelOuafek
    Mohamed-KhireddineKholladi
    [J]. 2018 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND INTELLIGENT SYSTEMS (CIIS 2018), 2018, : 66 - 70
  • [7] A Survey on Document Image Binarization Techniques
    Lokhande, Supriya Sunil
    Dawande, N. A.
    [J]. 1ST INTERNATIONAL CONFERENCE ON COMPUTING COMMUNICATION CONTROL AND AUTOMATION ICCUBEA 2015, 2015, : 742 - 746
  • [8] Document image analysis and recognition: a survey
    Arlazarov, V. V.
    Andreeva, E., I
    Bulatov, K. B.
    Nikolaev, D. P.
    Petrova, O. O.
    Savelev, B., I
    Slavin, O. A.
    [J]. COMPUTER OPTICS, 2022, 46 (04) : 567 - 589
  • [9] Historical Document Image Denoising by Ising Model
    Chen, Guoming
    Chen, Qiang
    Chen, Yiqun
    Zhu, Xiongyong
    [J]. 2020 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2020, : 457 - 461
  • [10] Document Image Quality Assessment: A Survey
    Alaei, Alireza
    Bui, Vinh
    Doermann, David
    Pal, Umapada
    [J]. ACM COMPUTING SURVEYS, 2024, 56 (02)