Hybrid model for Chinese character recognition based on Tesseract-OCR

被引:5
|
作者
Wang, Bo [1 ]
Ma, Yi-Wei [2 ,3 ]
Hu, Hong-Tao [1 ]
机构
[1] Shanghai Maritime Univ, Logist Engn Collage, Pudong New Area, Shanghai, Peoples R China
[2] Natl Taiwan Univ Sci & Technol, Dept Elect Engn, Taipei, Taiwan
[3] Shanghai Maritime Univ, China Inst FTZ Supply Chain, Pudong New Area, Shanghai, Peoples R China
关键词
hybrid model; image processing; Chinese character; optical character recognition; OCR; phrase processing; K-nearest neighbour; KNN; Tesseract-OCR; single char recognition;
D O I
10.1504/IJIPT.2020.106316
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Optical character recognition (OCR) is an important way to input information into a computer. And text information can be extracted by OCR from an image. Currently, the accuracy rate of Chinese OCR can also be improved. This study proposes a hybrid Chinese character recognition model based on the characteristics of Chinese. Before the OCR engine works, the model first filters the interference information in the image. Then the model adjusts the aspect ratio of the character. After an image is identified by OCR, single character recognition result is obtained. Then the result is checked and corrected on the phrase level. The experimental results show that the hybrid model improves the accuracy rate of Chinese OCR. Through image processing, the correct rate of recognition by the Tesseract-OCR engine is increased by about 12%, and the natural language processing improves the accuracy of the recognition result by about 5%.
引用
收藏
页码:102 / 108
页数:7
相关论文
共 50 条
  • [1] Barcode Character Defect Detection Method Based on Tesseract-OCR
    Zhao, Gang
    Lin, Luyu
    Chen, Yawen
    Liu, Shan
    Chu, Jie
    Luo, Zhuoran
    [J]. PROCEEDINGS OF 2017 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2017, : 1767 - 1771
  • [2] Document Segmentation and Language Translation Using Tesseract-OCR
    Thakare, Sahil
    Kamble, Ajay
    Thengne, Vishal
    Kamble, U. R.
    [J]. 2018 IEEE 13TH INTERNATIONAL CONFERENCE ON INDUSTRIAL AND INFORMATION SYSTEMS (IEEE ICIIS), 2018, : 161 - 164
  • [3] 基于Tesseract-ocr的藏文脱机识别
    翟娟秀
    普布旦增
    周欢欢
    王程新
    解颐
    [J]. 科技创业月刊, 2016, 29 (21) : 130 - 131
  • [4] Tesseract-OCR的文档扫描识别系统
    杨思怡
    付相祥
    吴晓华
    夏清
    [J]. 电子世界, 2021, (20) : 98 - 100
  • [5] 基于Tesseract-OCR文本识别的检票系统研究
    聂霜霜
    杨轶男
    卫晶
    马建钟
    [J]. 现代信息科技, 2022, (05) : 1 - 4
  • [6] 基于Tesseract-OCR的燃气表自动识别研究
    冯玉田
    侯玖廷
    顾乐易
    [J]. 电子测量技术, 2019, 42 (21) : 82 - 86
  • [7] Optical Character Recognition with Tesseract
    Geetha, C.
    Davamani, K. Anitha
    Teja, Krishna
    Sekhar, S. Hema
    [J]. JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2019, : 45 - 52
  • [8] Recognition of Offline Handwritten Chinese Characters Using the Tesseract Open Source OCR Engine
    Li, Qi
    An, Weihua
    Zhou, Anmi
    Ma, Lehui
    [J]. 2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 2, 2016, : 452 - 456
  • [9] 基于OpenCV和Tesseract-OCR的英文字符算法研究
    郭室驿
    [J]. 电脑编程技巧与维护, 2019, (06) : 45 - 49
  • [10] Google Tesseract: Optical Character Recognition (OCR) on HDD/SSD Labels Using Machine Vision
    Estrada Bugayong, Vernon
    Flores Villaverde, Jocelyn
    Linsangan, Noel B.
    [J]. 2022 14TH INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2022), 2022, : 56 - 60