Evaluating and mitigating the impact of OCR errors on information retrieval

被引:0
|
作者
Lucas Lima de Oliveira
Danny Suarez Vargas
Antônio Marcelo Azevedo Alexandre
Fábio Corrêa Cordeiro
Diogo da Silva Magalhães Gomes
Max de Castro Rodrigues
Regis Kruel Romeu
Viviane Pereira Moreira
机构
[1] Federal University of Rio Grande do Sul,Institute of Informatics
[2] Petrobras Research and Development Center (CENPES),Systems Engineering and Computer Science Program (PESC/COPPE)
[3] Federal University of Rio de Janeiro,School of Applied Mathematics
[4] Getulio Vargas Foundation,undefined
关键词
Information retrieval; OCR errors; Error correction; Geoscientific documents;
D O I
暂无
中图分类号
学科分类号
摘要
Optical character recognition (OCR) is typically used to extract the textual contents of scanned texts. The output of OCR can be noisy, especially when the quality of the scanned image is poor, which in turn can impact downstream tasks such as information retrieval (IR). Post-processing OCR-ed documents is an alternative to fix digitization errors and, intuitively, improve the results of downstream tasks. This work evaluates the impact of OCR digitization and correction on IR. We compared different digitization and correction methods on real OCR-ed data from an IR test collection with 22k documents and 34 query topics on the geoscientific domain in Portuguese. Our results have shown significant differences in IR metrics for the different digitization methods (up to 5 percentage points in terms of mean average precision). Regarding the impact of error correction, our results showed that on the average for the complete set of query topics, retrieval quality metrics change very little. However, a more detailed analysis revealed it improved 19 out of 34 query topics. Our findings indicate that, contrary to previous work, long documents are impacted by OCR errors.
引用
收藏
页码:45 / 62
页数:17
相关论文
共 50 条
  • [1] Evaluating and mitigating the impact of OCR errors on information retrieval
    de Oliveira, Lucas Lima
    Vargas, Danny Suarez
    Alexandre, Antonio Marcelo Azevedo
    Cordeiro, Fabio Correa
    Gomes, Diogo da Silva Magalhaes
    Rodrigues, Max de Castro
    Romeu, Regis Kruel
    Moreira, Viviane Pereira
    [J]. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2023, 24 (01) : 45 - 62
  • [2] Evaluating the Impact of OCR Errors on Topic Modeling
    Mutuvi, Stephen
    Doucet, Antoine
    Odeo, Moses
    Jatowt, Adam
    [J]. MATURITY AND INNOVATION IN DIGITAL LIBRARIES, ICADL 2018, 2018, 11279 : 3 - 14
  • [3] Evaluating text categorization in the presence of OCR errors
    Taghva, K
    Nartker, T
    Borsack, J
    Lumos, S
    Condit, A
    Young, R
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL VIII, 2001, 4307 : 68 - 74
  • [4] Impact of OCR errors on the use of digital libraries Towards a be, er access to information
    Chiron, Guillaume
    Doucet, Antoine
    Coustaty, Mickael
    Visani, Muriel
    Moreux, Jean-Philippe
    [J]. 2017 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2017), 2017, : 249 - 252
  • [5] Evaluating Supervised Topic Models in the Presence of OCR Errors
    Walker, Daniel
    Ringger, Eric
    Seppi, Kevin
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL XX, 2013, 8658
  • [6] Mitigating VHF Lightning Source Retrieval Errors
    Koshak, William J.
    Mach, Douglas M.
    Bitzer, Phillip M.
    [J]. JOURNAL OF ATMOSPHERIC AND OCEANIC TECHNOLOGY, 2018, 35 (05) : 1033 - 1052
  • [7] Evaluating the impact of information technology on medication errors: A simulation
    Anderson, JG
    Jay, SJ
    Anderson, M
    Hunt, TJ
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2003, 10 (03) : 292 - 293
  • [8] Evaluating impact of errors in prior information on performance of microwave tomography
    Kurrant, Douglas
    Fear, Elise
    Baran, Anastasia
    LoVetri, Joe
    [J]. 2016 17TH INTERNATIONAL SYMPOSIUM ON ANTENNA TECHNOLOGY AND APPLIED ELECTROMAGNETICS (ANTEM), 2016,
  • [9] ROLE OF OCR GROWS WITH INFORMATION IMPACT
    POLIZZANO, PF
    [J]. DATA MANAGEMENT, 1983, 21 (11): : 16 - 17
  • [10] ERRORS IN OCR
    GREY, PJ
    [J]. DATA PROCESSING MAGAZINE, 1970, 12 (09): : 7 - &