Using topic models for OCR correction

被引:0
|
作者
Faisal Farooq
Anurag Bhardwaj
Venu Govindaraju
机构
[1] Siemens Medical Solutions,Image and Knowledge Management
[2] University at Buffalo,Department of Computer Science and Engineering
关键词
OCR correction; Topic models; Lexicon reduction; Language models; Document categorization; Handwritten documents; Unconstrained handwriting;
D O I
暂无
中图分类号
学科分类号
摘要
Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained handwritten documents which typically use a restricted vocabulary (lexicon). But in the case of unconstrained handwritten documents, state-of-the-art word recognition accuracy is still below the acceptable limits. The objective of this research is to improve word recognition accuracy on unconstrained handwritten documents by applying a post-processing or OCR correction technique to the word recognition output. In this paper, we present two different methods for this purpose. First, we describe a lexicon reduction-based method by topic categorization of handwritten documents which is used to generate smaller topic-specific lexicons for improving the recognition accuracy. Second, we describe a method which uses topic-specific language models and a maximum-entropy based topic categorization model to refine the recognition output. We present the relative merits of each of these methods and report results on the publicly available IAM database.
引用
收藏
页码:153 / 164
页数:11
相关论文
共 50 条
  • [1] Using topic models for OCR correction
    Farooq, Faisal
    Bhardwaj, Anurag
    Govindaraju, Venu
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2009, 12 (03) : 153 - 164
  • [2] Context-sensitive error correction: Using topic models to improve OCR
    Wick, Michael L.
    Ross, Michael G.
    Learned-Miller, Erik G.
    ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 1168 - +
  • [3] Evaluating Supervised Topic Models in the Presence of OCR Errors
    Walker, Daniel
    Ringger, Eric
    Seppi, Kevin
    DOCUMENT RECOGNITION AND RETRIEVAL XX, 2013, 8658
  • [4] OCR Error Correction Using BiLSTM
    Kayabas, Ayla
    Topcu, Ahmet E.
    Kilic, Ozkan
    INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND ENERGY TECHNOLOGIES (ICECET 2021), 2021, : 2083 - 2087
  • [5] Evaluating the Robustness of Embedding-Based Topic Models to OCR Noise
    Zosa, Elaine
    Mutuvi, Stephen
    Granroth-Wilding, Mark
    Doucet, Antoine
    TOWARDS OPEN AND TRUSTWORTHY DIGITAL SOCIETIES, ICADL 2021, 2021, 13133 : 392 - 400
  • [6] Thai OCR error correction using genetic algorithm
    Kruatrachue, B
    Somguntar, K
    Siriboon, K
    FIRST INTERNATIONAL SYMPOSIUM ON CYBER WORLDS, PROCEEDINGS, 2002, : 137 - 141
  • [7] Using SMT for OCR Error Correction of Historical Texts
    Afli, Haithem
    Qiu, Zhengwei
    Way, Andy
    Sheridan, Paraic
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 962 - 966
  • [8] Videotext OCR using hidden Markov models
    Natarajan, P
    Elmieh, B
    Schwartz, R
    Makhoul, J
    SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 947 - 951
  • [9] Correction to: Visual topic models for healthcare data clustering
    K. Rajendra Prasad
    Moulana Mohammed
    R. M. Noorullah
    Evolutionary Intelligence, 2021, 14 (2) : 563 - 565
  • [10] Evaluating the Impact of OCR Errors on Topic Modeling
    Mutuvi, Stephen
    Doucet, Antoine
    Odeo, Moses
    Jatowt, Adam
    MATURITY AND INNOVATION IN DIGITAL LIBRARIES, ICADL 2018, 2018, 11279 : 3 - 14