Evaluating and mitigating the impact of OCR errors on information retrieval

被引:0
|
作者
Lucas Lima de Oliveira
Danny Suarez Vargas
Antônio Marcelo Azevedo Alexandre
Fábio Corrêa Cordeiro
Diogo da Silva Magalhães Gomes
Max de Castro Rodrigues
Regis Kruel Romeu
Viviane Pereira Moreira
机构
[1] Federal University of Rio Grande do Sul,Institute of Informatics
[2] Petrobras Research and Development Center (CENPES),Systems Engineering and Computer Science Program (PESC/COPPE)
[3] Federal University of Rio de Janeiro,School of Applied Mathematics
[4] Getulio Vargas Foundation,undefined
关键词
Information retrieval; OCR errors; Error correction; Geoscientific documents;
D O I
暂无
中图分类号
学科分类号
摘要
Optical character recognition (OCR) is typically used to extract the textual contents of scanned texts. The output of OCR can be noisy, especially when the quality of the scanned image is poor, which in turn can impact downstream tasks such as information retrieval (IR). Post-processing OCR-ed documents is an alternative to fix digitization errors and, intuitively, improve the results of downstream tasks. This work evaluates the impact of OCR digitization and correction on IR. We compared different digitization and correction methods on real OCR-ed data from an IR test collection with 22k documents and 34 query topics on the geoscientific domain in Portuguese. Our results have shown significant differences in IR metrics for the different digitization methods (up to 5 percentage points in terms of mean average precision). Regarding the impact of error correction, our results showed that on the average for the complete set of query topics, retrieval quality metrics change very little. However, a more detailed analysis revealed it improved 19 out of 34 query topics. Our findings indicate that, contrary to previous work, long documents are impacted by OCR errors.
引用
收藏
页码:45 / 62
页数:17
相关论文
共 50 条
  • [41] Evaluating information information retrieval using document popularity: An implementation on MapReduce
    Evangelopoulos, Xenophon
    Giannakouris-Salalidis, Victor
    Iliadis, Lazaros
    Makris, Christos
    Plegas, Yannis
    Plerou, Antonia
    Sioutas, Spyros
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2016, 51 : 16 - 23
  • [42] Measuring Typographical Errors' Impact on Retrieval in Bibliographic Databases
    Beall, Jeffrey
    Kafadar, Karen
    [J]. CATALOGING & CLASSIFICATION QUARTERLY, 2007, 44 (3-4) : 197 - 211
  • [43] EVALUATING IMPACT OF INFORMATION SYSTEMS
    CARLSON, ED
    [J]. MANAGEMENT INFORMATICS, 1974, 3 (02): : 57 - 67
  • [44] IMPACT OF INFORMATION RETRIEVAL ON CORPORATE STRUCTURE
    VEYETTE, JG
    [J]. COMMUNICATIONS OF THE ACM, 1961, 4 (07) : 305 - 305
  • [45] MODELING AND EVALUATING EFFECTIVENESS OF AN INFORMATION-RETRIEVAL SYSTEM WITH COMBINED RETRIEVAL SCHEME
    PEREGUDOV, AN
    [J]. NAUCHNO-TEKHNICHESKAYA INFORMATSIYA SERIYA 2-INFORMATSIONNYE PROTSESSY I SISTEMY, 1979, (10): : 17 - 22
  • [46] Evaluating the effectiveness of thesauri in digital information retrieval systems
    Sunny, Sanjeev K.
    Angadi, Mallikarjun
    [J]. ELECTRONIC LIBRARY, 2018, 36 (01): : 55 - 70
  • [47] A Dataset for Evaluating Query Suggestion Algorithms in Information Retrieval
    Badarinza, Ioan
    Sterca, Adrian
    Bufnea, Darius
    [J]. 2019 27TH INTERNATIONAL CONFERENCE ON SOFTWARE, TELECOMMUNICATIONS AND COMPUTER NETWORKS (SOFTCOM), 2019, : 36 - 41
  • [48] The study of criteria for evaluating OPACs as information retrieval systems
    Sugano, I
    [J]. LIBRARY AND INFORMATION SCIENCE, 1996, (35): : 41 - 49
  • [49] Comparing and Evaluating Information Retrieval Algorithms for News Recommendation
    Bogers, Toine
    van den Bosch, Antal
    [J]. RECSYS 07: PROCEEDINGS OF THE 2007 ACM CONFERENCE ON RECOMMENDER SYSTEMS, 2007, : 141 - 144
  • [50] Evaluating Temporal Information for Social Image Annotation and Retrieval
    Uricchio, Tiberio
    Ballan, Lamberto
    Bertini, Marco
    Del Bimbo, Alberto
    [J]. IMAGE ANALYSIS AND PROCESSING (ICIAP 2013), PT 1, 2013, 8156 : 722 - 732