Statistical learning for OCR error correction

被引:19
|
作者
Mei, Jie [1 ]
Islam, Aminul [2 ]
Moh'd, Abidalrahman [1 ]
Wu, Yajing [1 ]
Milios, Evangelos [1 ]
机构
[1] Dalhousie Univ, Fac Comp Sci, Halifax, NS B3H 1W5, Canada
[2] Univ Louisiana Lafayette, Sch Comp & Informat, Lafayette, LA 70503 USA
关键词
OCR post-processing; OCR error; Error correction; Statistical learning;
D O I
10.1016/j.ipm.2018.06.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Modern OCR engines incorporate some form of error correction, typically based on dictionaries. However, there are still residual errors that decrease performance of natural language processing algorithms applied to OCR text. In this paper, we present a statistical learning model for post processing OCR errors, either in a fully automatic manner or followed by minimal user interaction to further reduce error rate. Our model employs web-scale corpora and integrates a rich set of linguistic features. Through an interdependent learning pipeline, our model produces and continuously refines the error detection and suggestion of candidate corrections. Evaluated on a historical biology book with complex error patterns, our model outperforms various baseline methods in the automatic mode and shows an even greater advantage when involving minimal user interaction. Quantitative analysis of each computational step further suggests that our proposed model is well-suited for handling volatile and complex OCR error patterns, which are beyond the capabilities of error correction incorporated in OCR engines.
引用
收藏
页码:874 / 887
页数:14
相关论文
共 50 条
  • [1] OCR Error Correction Using BiLSTM
    Kayabas, Ayla
    Topcu, Ahmet E.
    Kilic, Ozkan
    [J]. INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND ENERGY TECHNOLOGIES (ICECET 2021), 2021, : 2083 - 2087
  • [2] An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
    Nguyen, Quoc-Dung
    Phan, Nguyet-Minh
    Kromer, Pavel
    Le, Duc-Anh
    [J]. IEEE ACCESS, 2023, 11 : 58406 - 58421
  • [3] OCR Error Correction for Vietnamese OCR Text with Different Edit Distances
    Quoc-Dung Nguyen
    Nguyet-Minh Phan
    Kromer, Pavel
    [J]. ADVANCES IN INTELLIGENT NETWORKING AND COLLABORATIVE SYSTEMS, INCOS-2022, 2022, 527 : 130 - 139
  • [4] Effect of OCR error correction on Arabic retrieval
    Magdy, Walid
    Darwish, Kareern
    [J]. INFORMATION RETRIEVAL, 2008, 11 (05): : 405 - 425
  • [5] Effect of OCR error correction on Arabic retrieval
    Walid Magdy
    Kareem Darwish
    [J]. Information Retrieval, 2008, 11 : 405 - 425
  • [6] Thai OCR error correction using genetic algorithm
    Kruatrachue, B
    Somguntar, K
    Siriboon, K
    [J]. FIRST INTERNATIONAL SYMPOSIUM ON CYBER WORLDS, PROCEEDINGS, 2002, : 137 - 141
  • [7] Using SMT for OCR Error Correction of Historical Texts
    Afli, Haithem
    Qiu, Zhengwei
    Way, Andy
    Sheridan, Paraic
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 962 - 966
  • [8] OCR Error Correction for Unconstrained Vietnamese Handwritten Text
    Nguyen, Quoc-Dung
    Le, Duc-Anh
    Zelinka, Ivan
    [J]. SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 132 - 138
  • [9] Thai OCR error correction using token passing algorithm
    Rodphon, M
    Siriboon, K
    Kruatrachue, B
    [J]. 2001 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING, VOLS I AND II, CONFERENCE PROCEEDINGS, 2001, : 599 - 602
  • [10] Progress of combining trigram and Winnow in Thai OCR error correction
    Meknavin, S
    Kijsirikul, B
    Chotimongkol, A
    Nuttee, C
    [J]. APCCAS '98 - IEEE ASIA-PACIFIC CONFERENCE ON CIRCUITS AND SYSTEMS: MICROELECTRONICS AND INTEGRATING SYSTEMS, 1998, : 555 - 558