Improving OCR for Historical Documents by Modeling Image Distortion

被引:3
|
作者
Maekawa, Keiya [1 ]
Tomiura, Yoichi [1 ]
Fukuda, Satoshi [1 ]
Ishita, Emi [1 ]
Uchiyama, Hideaki [1 ]
机构
[1] Kyushu Univ, Nishi Ku, Fukuoka, Japan
关键词
OCR error; Information retrieval; Historical document image;
D O I
10.1007/978-3-030-34058-2_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Archives hold printed historical documents, many of which have deteriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem reduces the accuracy of information retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of deteriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.
引用
收藏
页码:312 / 316
页数:5
相关论文
共 50 条
  • [1] OCR binarization and image pre-processing for searching historical documents
    Gupta, Maya R.
    Jacobson, Nathaniel P.
    Garcia, Eric K.
    PATTERN RECOGNITION, 2007, 40 (02) : 389 - 397
  • [2] Automatic Assessment of OCR Quality in Historical Documents
    Gupta, Anshul
    Gutierrez-Osuna, Ricardo
    Christy, Matthew
    Capitanu, Boris
    Auvil, Loretta
    Grumbach, Liz
    Furuta, Richard
    Mandell, Laura
    PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 1735 - 1741
  • [3] Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents
    Ma, Hsing-Yuan
    Huang, Hen-Hsen
    Liu, Chao-Lin
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23808 - 23810
  • [4] Enhancing OCR in historical documents with complex layouts through machine learningEnhancing OCR in historical documents...D. Fleischhacker et al.
    David Fleischhacker
    Roman Kern
    Wolfgang Göderle
    International Journal on Digital Libraries, 2025, 26 (1)
  • [5] OCR for Bilingual documents using Language Modeling
    Ray, Anupama
    Rajeswar, Sai
    Chaudhury, Santanu
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1256 - 1260
  • [6] Image preprocessing for improving OCR accuracy
    Bieniecki, Wojciech
    Grabowski, Szymon
    Rozenberg, Wojciech
    PERSPECTIVE TECHNOLOGIES AND METHODS IN MEMS DESIGN, 2007, : 75 - +
  • [7] Generating Synthetic Handwritten Historical Documents with OCR Constrained GANs
    Vogtlin, Lars
    Drazyk, Manuel
    Pondenkandath, Vinaychandran
    Alberti, Michele
    Ingold, Rolf
    DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT III, 2021, 12823 : 610 - 625
  • [8] Improving OCR Performance with Background Image Elimination
    Shen, Mande
    Lei, Hansheng
    2015 12TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2015, : 1566 - 1570
  • [9] Binarization-free OCR for Historical Documents Using LSTM Networks
    Yousefi, Mohammad Reza
    Soheili, Mohammad Reza
    Breuel, Thomas M.
    Kabir, Ehsanollah
    Stricker, Didier
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1121 - 1125
  • [10] Building an efficient OCR system for historical documents with little training data
    Jiří Martínek
    Ladislav Lenc
    Pavel Král
    Neural Computing and Applications, 2020, 32 : 17209 - 17227