Hybrid model for Chinese character recognition based on Tesseract-OCR

被引:5
|
作者
Wang, Bo [1 ]
Ma, Yi-Wei [2 ,3 ]
Hu, Hong-Tao [1 ]
机构
[1] Shanghai Maritime Univ, Logist Engn Collage, Pudong New Area, Shanghai, Peoples R China
[2] Natl Taiwan Univ Sci & Technol, Dept Elect Engn, Taipei, Taiwan
[3] Shanghai Maritime Univ, China Inst FTZ Supply Chain, Pudong New Area, Shanghai, Peoples R China
关键词
hybrid model; image processing; Chinese character; optical character recognition; OCR; phrase processing; K-nearest neighbour; KNN; Tesseract-OCR; single char recognition;
D O I
10.1504/IJIPT.2020.106316
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Optical character recognition (OCR) is an important way to input information into a computer. And text information can be extracted by OCR from an image. Currently, the accuracy rate of Chinese OCR can also be improved. This study proposes a hybrid Chinese character recognition model based on the characteristics of Chinese. Before the OCR engine works, the model first filters the interference information in the image. Then the model adjusts the aspect ratio of the character. After an image is identified by OCR, single character recognition result is obtained. Then the result is checked and corrected on the phrase level. The experimental results show that the hybrid model improves the accuracy rate of Chinese OCR. Through image processing, the correct rate of recognition by the Tesseract-OCR engine is increased by about 12%, and the natural language processing improves the accuracy of the recognition result by about 5%.
引用
收藏
页码:102 / 108
页数:7
相关论文
共 50 条
  • [21] Applying SIMD to optical character recognition (OCR)
    Yu, Guan
    Gauthier, Lafruit
    Stahl, Richard
    Corporaal, Henk
    Schelkens, Peter
    [J]. OPTICAL AND DIGITAL IMAGE PROCESSING, 2008, 7000
  • [22] How to Improve Optical Character Recognition of Historical Finnish Newspapers Using Open Source Tesseract OCR Engine - Final Notes on Development and Evaluation
    Koistinen, Mika
    Kettunen, Kimmo
    Kervinen, Jukka
    [J]. HUMAN LANGUAGE TECHNOLOGY. CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2017, 2020, 12598 : 17 - 30
  • [23] A new hybrid methodology for intelligent Chinese character recognition
    Al-Dabass, D
    Evans, D
    Ren, ML
    [J]. HIS'04: Fourth International Conference on Hybrid Intelligent Systems, Proceedings, 2005, : 104 - 109
  • [24] A hybrid post-processing system for offline handwritten Chinese character recognition based on a statistical language model
    Xu, RF
    Yeung, DS
    Sh, DM
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2005, 19 (03) : 415 - 428
  • [25] Offline handwritten Chinese character recognition based on DBN fusion model
    Liu, Lu
    Sun, Weiwei
    Ding, Bo
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION (ICIA), 2016, : 1807 - 1811
  • [26] When Tesseract Does It Alone Optical Character Recognition of Medieval Texts
    Novotny, Vit
    [J]. RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2020), 2020, : 3 - 12
  • [27] Optical Character Recognition (OCR) Performance in Server-based Mobile Environment
    Mantoro, Teddy
    Sobri, Abdul Muis
    Usino, Wendi
    [J]. 2013 International Conference on Advanced Computer Science Applications and Technologies (ACSAT), 2014, : 423 - 428
  • [28] Lexicon Reduction for Urdu/Arabic Script Based Character Recognition: A Multilingual OCR
    Naz, Saeeda
    Umar, Arif Iqbal
    Razzak, Muhammad Imran
    [J]. MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2016, 35 (02) : 209 - 216
  • [29] Shape decomposition-based handwritten compound character recognition for Bangla OCR
    Pramanik, Rahul
    Bag, Soumen
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2018, 50 : 123 - 134
  • [30] Hybrid GMDH Model for Handwritten Character Recognition
    Dhawan, Parag
    Dongre, Snehlata
    Tidke, D. J.
    [J]. 2013 IEEE INTERNATIONAL MULTI CONFERENCE ON AUTOMATION, COMPUTING, COMMUNICATION, CONTROL AND COMPRESSED SENSING (IMAC4S), 2013, : 698 - 703