Improving OCR for Historical Documents by Modeling Image Distortion

被引:3
|
作者
Maekawa, Keiya [1 ]
Tomiura, Yoichi [1 ]
Fukuda, Satoshi [1 ]
Ishita, Emi [1 ]
Uchiyama, Hideaki [1 ]
机构
[1] Kyushu Univ, Nishi Ku, Fukuoka, Japan
关键词
OCR error; Information retrieval; Historical document image;
D O I
10.1007/978-3-030-34058-2_31
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Archives hold printed historical documents, many of which have deteriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem reduces the accuracy of information retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of deteriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.
引用
收藏
页码:312 / 316
页数:5
相关论文
共 50 条
  • [21] Table of Contents Recognition in OCR Documents using Image-based Machine Learning
    Kosaraju, Sai
    Tsaku, Nelson Zange
    Patel, Pritesh
    Bayramoglu, Tanju
    Modgil, Girish
    Kang, Mingon
    PROCEEDINGS OF THE 2019 ANNUAL ACM SOUTHEAST CONFERENCE (ACMSE 2019), 2019, : 186 - 189
  • [22] Perspective Distortion Modeling for Image Measurements
    Bousaid A.
    Theodoridis T.
    Nefti-Meziani S.
    Davis S.
    IEEE Access, 2020, 8 : 15322 - 15331
  • [23] Perspective Distortion Modeling for Image Measurements
    Bousaid, Alexandre
    Theodoridis, Theodoros
    Nefti-Meziani, Samia
    Davis, Steve
    IEEE ACCESS, 2020, 8 : 15322 - 15331
  • [24] Improving OCR-based Image Captioning by Incorporating Geometrical Relationship
    Wang, Jing
    Tang, Jinhui
    Yang, Mingkun
    Bai, Xiang
    Luo, Jiebo
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1306 - 1315
  • [25] Summarization of imaged documents without OCR
    Xerox Palo Alto Research Cent, Palo Alto, United States
    Comput Vision Image Undersanding, 3 (307-320):
  • [26] Hybrid OCR combination for ancient documents
    Cecotti, H
    Belaïd, A
    PATTERN RECOGNITION AND DATA MINING, PT 1, PROCEEDINGS, 2005, 3686 : 646 - 653
  • [27] Retrieving poorly degraded OCR documents
    Fataicha, Y.
    Cheriet, M.
    Nie, J. Y.
    Suen, C. Y.
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2006, 8 (01) : 15 - 26
  • [28] Retrieving poorly degraded OCR documents
    Y. Fataicha
    M. Cheriet
    J. Y. Nie
    C. Y. Suen
    International Journal of Document Analysis and Recognition (IJDAR), 2006, 8 : 15 - 26
  • [29] A database of glyphs for OCR of mathematical documents
    Sexton, A
    Sorge, V
    MATHEMATICAL KNOWLEDGE MANAGEMENT, 2006, 3863 : 203 - 216
  • [30] Keywords image retrieval in historical handwritten Arabic documents
    Saabni, Raid
    El-Sana, Jihad
    JOURNAL OF ELECTRONIC IMAGING, 2013, 22 (01)