A generative probabilistic OCR model for NLP applications

被引:0
|
作者
Kolak, O [1 ]
Byrne, W [1 ]
Resnik, P [1 ]
机构
[1] Univ Maryland, College Pk, MD 20742 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make it more useful for NLP tasks. We present an implementation of the model based on finite-state models, demonstrate the model's ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text.
引用
收藏
页码:134 / 141
页数:8
相关论文
共 50 条
  • [1] Generative Model for NLP Applications based on Component Extraction
    Bhardwaj, Anupam
    Khanna, Pooja
    Kumar, Sachin
    Pragya
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND DATA SCIENCE, 2020, 167 : 918 - 931
  • [2] Generative Model for Probabilistic Inference
    Liu, Yi
    Li, Yunchun
    Zhou, Honggang
    Yang, Hailong
    Li, Wei
    IEEE 17TH INT CONF ON DEPENDABLE, AUTONOM AND SECURE COMP / IEEE 17TH INT CONF ON PERVAS INTELLIGENCE AND COMP / IEEE 5TH INT CONF ON CLOUD AND BIG DATA COMP / IEEE 4TH CYBER SCIENCE AND TECHNOLOGY CONGRESS (DASC/PICOM/CBDCOM/CYBERSCITECH), 2019, : 803 - 810
  • [3] A Probabilistic Generative Model of Linguistic Typology
    Bjerva, Johannes
    Kementchedjhieva, Yova
    Cotterell, Ryan
    Augenstein, Isabelle
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 1529 - 1540
  • [4] Domain Adaptation in NLP based on Hybrid Generative and Discriminative Model
    Liu, Kang
    Zhao, Jun
    PROCEEDINGS OF THE 2008 CHINESE CONFERENCE ON PATTERN RECOGNITION (CCPR 2008), 2008, : 7 - 12
  • [5] A Decision Model for Designing NLP Applications
    Chen, Eason
    Tseng, Yuen-Hsien
    COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 1206 - 1210
  • [6] Assessing the Impact of OCR Quality on Downstream NLP Tasks
    van Strien, Daniel
    Beelen, Kaspar
    Ardanuy, Mariona Coll
    Hosseini, Kasra
    McGillivray, Barbara
    Colavizza, Giovanni
    ICAART: PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE, VOL 1, 2020, : 484 - 496
  • [7] 基于NLP的OCR后处理方法
    李文华
    罗改龙
    软件导刊, 2010, 9 (10) : 35 - 36
  • [8] A generative, probabilistic model of local protein structure
    Boomsma, Wouter
    Mardia, Kanti V.
    Taylor, Charles C.
    Ferkinghoff-Borg, Jesper
    Krogh, Anders
    Hamelryck, Thomas
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2008, 105 (26) : 8932 - 8937
  • [9] Probabilistic enhanced mapping with the generative tabular model
    Priam, Rodolphe
    Nadif, Mohamed
    ICDM 2006: Sixth International Conference on Data Mining, Proceedings, 2006, : 1021 - 1025
  • [10] Sampling Graphs From a Probabilistic Generative Model
    Han, Lin
    Wilson, Richard
    Hancock, Edwin
    Bai, Lu
    Ren, Peng
    2012 21ST INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR 2012), 2012, : 1643 - 1646