Hybrid model for Chinese character recognition based on Tesseract-OCR

被引:5
|
作者
Wang, Bo [1 ]
Ma, Yi-Wei [2 ,3 ]
Hu, Hong-Tao [1 ]
机构
[1] Shanghai Maritime Univ, Logist Engn Collage, Pudong New Area, Shanghai, Peoples R China
[2] Natl Taiwan Univ Sci & Technol, Dept Elect Engn, Taipei, Taiwan
[3] Shanghai Maritime Univ, China Inst FTZ Supply Chain, Pudong New Area, Shanghai, Peoples R China
关键词
hybrid model; image processing; Chinese character; optical character recognition; OCR; phrase processing; K-nearest neighbour; KNN; Tesseract-OCR; single char recognition;
D O I
10.1504/IJIPT.2020.106316
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Optical character recognition (OCR) is an important way to input information into a computer. And text information can be extracted by OCR from an image. Currently, the accuracy rate of Chinese OCR can also be improved. This study proposes a hybrid Chinese character recognition model based on the characteristics of Chinese. Before the OCR engine works, the model first filters the interference information in the image. Then the model adjusts the aspect ratio of the character. After an image is identified by OCR, single character recognition result is obtained. Then the result is checked and corrected on the phrase level. The experimental results show that the hybrid model improves the accuracy rate of Chinese OCR. Through image processing, the correct rate of recognition by the Tesseract-OCR engine is increased by about 12%, and the natural language processing improves the accuracy of the recognition result by about 5%.
引用
收藏
页码:102 / 108
页数:7
相关论文
共 50 条
  • [41] A word language model based contextual language processing on Chinese character recognition
    Huang, Chen
    Ding, Xiaoqing
    Chen, Yan
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL XVII, 2010, 7534
  • [42] Chinese Medical Entity Recognition Model Based on Character and Word Vector Fusion
    Zhang, Qinghui
    Hou, Lei
    Lv, Pengtao
    Zhang, Mengya
    Yang, Hongwei
    [J]. SCIENTIFIC PROGRAMMING, 2021, 2021
  • [43] Brazilian Mercosur License Plate Detection and Recognition Using Haar Cascade and Tesseract OCR on Synthetic Imagery
    Saboia, Cyro M. G.
    Reboucas Filho, Pedro Pedrosa
    [J]. INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, ISDA 2021, 2022, 418 : 849 - 858
  • [44] A Method of Chinese Text Detecting Errors Based on Recognition Errors by OCR
    Tian Zhuo
    Li Baicheng
    [J]. MODERN TECHNOLOGIES IN MATERIALS, MECHANICS AND INTELLIGENT SYSTEMS, 2014, 1049 : 1540 - 1543
  • [45] Book Spine Recognition Based on OpenCV and Tesseract
    Cao, Lina
    Liu, Mengdi
    Dong, Zhuqing
    Yang, Hua
    [J]. 2019 11TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC 2019), VOL 1, 2019, : 332 - 336
  • [46] Language model for Chinese character recognition with dense errors
    Zhang, S
    Wu, XL
    [J]. IC-AI'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS I-III, 2001, : 598 - 602
  • [47] Variable length language model for Chinese character recognition
    Zhang, S
    Wu, XL
    [J]. ADVANCES IN MULTIMODAL INTERFACES - ICMI 2000, PROCEEDINGS, 2000, 1948 : 267 - 271
  • [48] Language model of Chinese character recognition and its application
    Zhang, S
    Wu, XL
    [J]. 2000 5TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS I-III, 2000, : 1507 - 1513
  • [49] Vehicle Detection and Categorization for a Toll Charging System Based on TESSERACT OCR Using the IoT
    Krishna, A. Vijaya
    Naseera, Shaik
    [J]. ICCCE 2018, 2019, 500 : 193 - 202
  • [50] A proposed approach for character recognition using Document Analysis with OCR
    Singh, Harneet
    Sachan, Anmol
    [J]. PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 190 - 195