Improving OCR for Historical Documents by Modeling Image Distortion

被引：3

作者：

Maekawa, Keiya ^{[1
]}

Tomiura, Yoichi ^{[1
]}

Fukuda, Satoshi ^{[1
]}

Ishita, Emi ^{[1
]}

Uchiyama, Hideaki ^{[1
]}

机构：

[1] Kyushu Univ, Nishi Ku, Fukuoka, Japan

来源：

DIGITAL LIBRARIES AT THE CROSSROADS OF DIGITAL INFORMATION FOR THE FUTURE, ICADL 2019 | 2019年 / 11853卷

关键词：

OCR error; Information retrieval; Historical document image;

D O I：

10.1007/978-3-030-34058-2_31

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Archives hold printed historical documents, many of which have deteriorated. It is difficult to extract text from such images without errors using optical character recognition (OCR). This problem reduces the accuracy of information retrieval. Therefore, it is necessary to improve the performance of OCR for images of deteriorated documents. One approach is to convert images of deteriorated documents to clear images, to make it easier for an OCR system to recognize text. To perform this conversion using a neural network, data is needed to train it. It is hard to prepare training data consisting of pairs of a deteriorated image and an image from which deterioration has been removed; however, it is easy to prepare training data consisting of pairs of a clear image and an image created by adding noise to it. In this study, PDFs of historical documents were collected and converted to text and JPEG images. Noise was added to the JPEG images to create a dataset in which the images had noise similar to that of the actual printed documents. U-Net, a type of neural network, was trained using this dataset. The performance of OCR for an image with noise in the test data was compared with the performance of OCR for an image generated from it by the trained U-Net. An improvement in the OCR recognition rate was confirmed.

引用

页码：312 / 316

页数：5

共 50 条

[1] OCR binarization and image pre-processing for searching historical documents
Gupta, Maya R.
Jacobson, Nathaniel P.
Garcia, Eric K.
PATTERN RECOGNITION, 2007, 40 (02) : 389 - 397
[2] Automatic Assessment of OCR Quality in Historical Documents
Gupta, Anshul
Gutierrez-Osuna, Ricardo
Christy, Matthew
Capitanu, Boris
Auvil, Loretta
Grumbach, Liz
Furuta, Richard
Mandell, Laura
PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 1735 - 1741
[3] Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents
Ma, Hsing-Yuan
Huang, Hen-Hsen
Liu, Chao-Lin
THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23808 - 23810
[4] Enhancing OCR in historical documents with complex layouts through machine learningEnhancing OCR in historical documents...D. Fleischhacker et al.
David Fleischhacker
Roman Kern
Wolfgang Göderle
International Journal on Digital Libraries, 2025, 26 (1)
[5] OCR for Bilingual documents using Language Modeling
Ray, Anupama
Rajeswar, Sai
Chaudhury, Santanu
2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1256 - 1260
[6] Image preprocessing for improving OCR accuracy
Bieniecki, Wojciech
Grabowski, Szymon
Rozenberg, Wojciech
PERSPECTIVE TECHNOLOGIES AND METHODS IN MEMS DESIGN, 2007, : 75 - +
[7] Generating Synthetic Handwritten Historical Documents with OCR Constrained GANs
Vogtlin, Lars
Drazyk, Manuel
Pondenkandath, Vinaychandran
Alberti, Michele
Ingold, Rolf
DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT III, 2021, 12823 : 610 - 625
[8] Improving OCR Performance with Background Image Elimination
Shen, Mande
Lei, Hansheng
2015 12TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2015, : 1566 - 1570
[9] Binarization-free OCR for Historical Documents Using LSTM Networks
Yousefi, Mohammad Reza
Soheili, Mohammad Reza
Breuel, Thomas M.
Kabir, Ehsanollah
Stricker, Didier
2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1121 - 1125
[10] Building an efficient OCR system for historical documents with little training data
Jiří Martínek
Ladislav Lenc
Pavel Král
Neural Computing and Applications, 2020, 32 : 17209 - 17227

← 1 2 3 4 5 →