Impact of OCR errors on the use of digital libraries Towards a be, er access to information

被引:0
|
作者
Chiron, Guillaume [1 ]
Doucet, Antoine [2 ]
Coustaty, Mickael [2 ]
Visani, Muriel [2 ]
Moreux, Jean-Philippe [1 ]
机构
[1] Natl Lib France, F-75706 Paris, France
[2] Univ La Rochelle, L3i Lab, Ave Michel Crepeau, F-17042 La Rochelle 1, France
关键词
Digital libraries; OCR errors; indexation bias; search logs;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Digital collections are increasingly used for a variety of purposes. In Europe only, we can conservatively estimate that tens of thousands of users consult digital libraries daily. The usages are often motivated by qualitative and quantitative research. However, caution must be advised as most digitized documents are indexed through their OCRed version, which is far from perfect, especially for ancient documents. In this paper, we aim to estimate the impact of OCR errors on the use of a major online platform: The Gallica digital library from the National Library of France. It accounts for more than 100M OCRed documents and receives 80M search queries every year. In this context, we introduce two main contributions. First, an original corpus of OCRed documents composed of 12M characters along with the corresponding gold standard is presented and provided, with an equal share of English- and French-written documents. Next, statistics on OCR errors have been computed thanks to a novel alignment method introduced in this paper. Making use of all the user queries submitted to the Gallica portal over 4 months, we take advantage of our error model to propose an indicator for predicting the relative risk that queried terms mismatch targeted resources due to OCR errors, underlining the critical extent to which OCR quality impacts on digital library access.
引用
收藏
页码:249 / 252
页数:4
相关论文
共 50 条
  • [1] The use of intelligent information access technologies in digital libraries
    Chen, Jiangping
    Li, Yuhua
    Li, Gang
    [J]. WEB INFORMATION SYSTEMS - WISE 2006 WORKSHOPS, PROCEEDINGS, 2006, 4256 : 239 - 250
  • [2] Digital Libraries Information Access
    Albertson, Dan
    [J]. LIBRARY & INFORMATION SCIENCE RESEARCH, 2014, 36 (02) : 131 - 132
  • [3] Evaluating and mitigating the impact of OCR errors on information retrieval
    de Oliveira, Lucas Lima
    Vargas, Danny Suarez
    Alexandre, Antonio Marcelo Azevedo
    Cordeiro, Fabio Correa
    Gomes, Diogo da Silva Magalhaes
    Rodrigues, Max de Castro
    Romeu, Regis Kruel
    Moreira, Viviane Pereira
    [J]. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2023, 24 (01) : 45 - 62
  • [4] Evaluating and mitigating the impact of OCR errors on information retrieval
    Lucas Lima de Oliveira
    Danny Suarez Vargas
    Antônio Marcelo Azevedo Alexandre
    Fábio Corrêa Cordeiro
    Diogo da Silva Magalhães Gomes
    Max de Castro Rodrigues
    Regis Kruel Romeu
    Viviane Pereira Moreira
    [J]. International Journal on Digital Libraries, 2023, 24 : 45 - 62
  • [5] Digital Libraries and Information Access.
    Stine, Kathryn
    [J]. LIBRARY RESOURCES & TECHNICAL SERVICES, 2014, 58 (03): : 214 - 215
  • [6] Digital Libraries and Access to Information in Nigerian Federal Universities: The Impact of Technology Variables
    Igbo, Harriet Uche
    Imo, Nwabuisi T.
    [J]. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2020, 19 (02)
  • [7] Digital Libraries and Information Access: Research Perspectives
    du Preez, Madely
    [J]. ONLINE INFORMATION REVIEW, 2013, 37 (02) : 342 - 343
  • [8] Digital libraries and information access: Research perspectives
    McCallum, Ian
    [J]. AUSTRALIAN LIBRARY JOURNAL, 2014, 63 (01): : 58 - 59
  • [9] Digital Libraries and Information Access: Research Perspectives
    Pors, Niels Ole
    [J]. JOURNAL OF LIBRARIANSHIP AND INFORMATION SCIENCE, 2013, 45 (02) : 180 - 182
  • [10] Digital Libraries and Information Access: Research Perspectives
    Martin, Nora
    [J]. LIBRARY MANAGEMENT, 2013, 34 (03) : 264 - +