ICDAR2017 Competition on Post-OCR Text Correction

被引:20
|
作者
Chiron, Guillaume [1 ]
Doucet, Antoine [2 ]
Coustaty, Mickael [2 ]
Moreux, Jean-Philippe [1 ]
机构
[1] Natl Lib France, F-75706 Paris, France
[2] Univ La Rochelle, Lab L3i, Av Michel Crepeau, F-17000 La Rochelle, France
关键词
D O I
10.1109/ICDAR.2017.232
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper describes the ICDAR2017 competition on post-OCR text correction and presents the different methods submitted by the participants. OCR has been an active research field for over the past 30 years but results are still imperfect, especially for historical documents. The purpose of this competition is to compare and evaluate automatic approaches for correcting (denoising) OCR-ed texts. The challenge consists of two independent tasks: 1) error detection and 2) error correction. An original dataset of 12M OCR-ed symbols along with an aligned ground truth was provided to the participants with 80% of the dataset dedicated to the training and 20% to the evaluation. Different sources were aggregated and namely contain newspapers and monographs covering 2 languages (English and French). 11 teams submitted results, while the difficulty of the task was underlined by the fact that only half of the submitted methods were able to denoise the evaluation dataset on average. In any case, this competition, which counted 35 registrations, illustrates the strong interest of the community in this essential problem, which is key to any digitization process involving textual data.
引用
收藏
页码:1423 / 1428
页数:6
相关论文
共 50 条
  • [41] A Spell Correction Model for OCR Errors for Arabic Text
    Muhammad, Mariam
    ELGhazaly, Tarek
    Ezzat, Mostafa
    Gheith, Mervat
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 124 - 136
  • [42] OCR Error Correction for Unconstrained Vietnamese Handwritten Text
    Nguyen, Quoc-Dung
    Le, Duc-Anh
    Zelinka, Ivan
    SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 132 - 138
  • [43] OCR Post Correction for Endangered Language Texts
    Rijhwani, Shruti
    Anastasopoulos, Antonios
    Neubig, Graham
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5931 - 5942
  • [44] ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images
    Shahab, Asif
    Shafait, Faisal
    Dengel, Andreas
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1491 - 1496
  • [45] OCRSpell: An interactive spelling correction system for OCR errors in text
    Taghva K.
    Stofsky E.
    International Journal on Document Analysis and Recognition, 2001, Springer Verlag (03) : 125 - 137
  • [46] Efficient Solutions for OCR Text Remote Correction in Content Conversion Systems
    Boiangiu, Costin-Anton
    Topliceanu, Alexandru
    Bucur, Ion
    CONTROL ENGINEERING AND APPLIED INFORMATICS, 2013, 15 (01): : 22 - 32
  • [47] AUTOMATIC ERROR-CORRECTION AND QUERY EVALUATION OF OCR GENERATED TEXT
    TAGHVA, K
    BORSACK, J
    CONDIT, A
    ONLINE & CDROM REVIEW, 1994, 18 (01): : 47 - 47
  • [48] Toward the optimized crowdsourcing strategy for OCR post-correction
    Suissa, Omri
    Elmalech, Avshalom
    Zhitomirsky-Geffet, Maayan
    ASLIB JOURNAL OF INFORMATION MANAGEMENT, 2019, 72 (02) : 179 - 197
  • [49] Neural OCR Post-Hoc Correction of Historical Corpora
    Lyu, Lijun
    Koutraki, Maria
    Krickl, Martin
    Fetahu, Besnik
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 479 - 493
  • [50] ICDAR2013 Competition on Multi-font and Multi-size Digitally Represented Arabic Text
    Slimane, Fouad
    Kanoun, Slim
    El Abed, Haikal
    Alimi, Adel M.
    Ingold, Rolf
    Hennebert, Jean
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 1433 - 1437