Optical character recognition errors and their effects on natural language processing

被引:0
|
作者
Daniel Lopresti
机构
[1] Lehigh University,Department of Computer Science and Engineering
关键词
Performance evaluation; Optical character recognition; Sentence boundary detection; Tokenization; Part-of-speech tagging;
D O I
暂无
中图分类号
学科分类号
摘要
Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to downstream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. This dataset has also been made available online to encourage future investigations.
引用
收藏
页码:141 / 151
页数:10
相关论文
共 50 条
  • [1] Optical character recognition errors and their effects on natural language processing
    Lopresti, Daniel
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2009, 12 (03) : 141 - 151
  • [2] Facilitating clinical research through automation: Combining optical character recognition with natural language processing
    Hom, Julie
    Nikowitz, Janet
    Ottesen, Rebecca
    Niland, Joyce C.
    [J]. CLINICAL TRIALS, 2022, 19 (05) : 504 - 511
  • [3] Authorship Attribution and Optical Character Recognition Errors
    Juola, Patrick
    Noecker, John I., Jr.
    Ryan, Michael V.
    [J]. TRAITEMENT AUTOMATIQUE DES LANGUES, 2012, 53 (03): : 101 - 127
  • [4] Language model for Chinese character recognition with dense errors
    Zhang, S
    Wu, XL
    [J]. IC-AI'2001: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS I-III, 2001, : 598 - 602
  • [5] OPTICAL CHARACTER RECOGNITION CUTS DATA COLLECTION ERRORS
    FAHRLANDER, H
    BELL, B
    [J]. INDUSTRIAL ENGINEER, 1971, 3 (02): : 26 - +
  • [6] Validation of a Hybrid Natural Language Processing Tool Utilizing Optical Character Recognition for Data Extraction From Scanned Colonoscopy Reports
    Hayat, Umar
    Isseh, Mahmoud
    Isseh, Nazih
    Ibrahim, Mounir
    McMichael, John
    Lopez, Rocio
    Bhatt, Amit
    Rhodes, Colin
    Burke, Carol A.
    Rizk, Maged
    [J]. GASTROINTESTINAL ENDOSCOPY, 2017, 85 (05) : AB417 - AB418
  • [7] Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports
    Laique, Sobia Nasir
    Hayat, Umar
    Sarvepalli, Shashank
    Vaughn, Byron
    Ibrahim, Mounir
    McMichael, John
    Qaiser, Kanza Noor
    Burke, Carol
    Bhatt, Amit
    Rhodes, Colin
    Rizk, Maged K.
    [J]. GASTROINTESTINAL ENDOSCOPY, 2021, 93 (03) : 750 - 757
  • [8] Virtual assistant for first responders using natural language understanding and optical character recognition
    Do, Vickie
    Huyen, Alexander
    Joubert, Jacques
    Gabriel, Mina
    Yun, Kyongsik
    Lu, Thomas
    Chow, Edward
    [J]. PATTERN RECOGNITION AND TRACKING XXXIII, 2022, 12101
  • [9] On Natural Language Processing and Plan Recognition
    Geib, Christopher W.
    Steedman, Mark
    [J]. 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2007, : 1612 - 1617
  • [10] Character strings to natural language processing in information retrieval
    Mohd, T
    Sembok, T
    [J]. DIGITAL LIBRARIES: TECHNOLOGY AND MANAGEMENT OF INDIGENOUS KNOWLEDGE FOR GLOBAL ACCESS, 2003, 2911 : 26 - 33