Effect of OCR error correction on Arabic retrieval

被引:0
|
作者
Walid Magdy
Kareem Darwish
机构
[1] Cairo Microsoft Innovation Center,
来源
Information Retrieval | 2008年 / 11卷
关键词
OCR; Language modeling; Information retrieval; Error correction;
D O I
暂无
中图分类号
学科分类号
摘要
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.
引用
收藏
页码:405 / 425
页数:20
相关论文
共 50 条
  • [21] Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines
    Lund, William B.
    Walker, Daniel D.
    Ringger, Eric K.
    [J]. 11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 764 - 768
  • [22] Deep Learning for Arabic Error Detection and Correction
    Alkhatib, Manar
    Monem, Azza Abdel
    Shaalan, Khaled
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (05)
  • [23] Yarmouk Arabic OCR Dataset
    Abu Doush, Iyad
    AlKhateeb, Faisal
    Gharibeh, Anwaar Hamdi
    [J]. 2018 8TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (CSIT), 2018, : 150 - 154
  • [24] OCR Error Correction Using Character Correction and Feature-Based Word Classification
    Kissos, Ido
    Dershowitz, Nachum
    [J]. PROCEEDINGS OF 12TH IAPR WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS, (DAS 2016), 2016, : 198 - 203
  • [25] OCR error correction using correction patterns and self-organizing migrating algorithm
    Nguyen, Quoc-Dung
    Le, Duc-Anh
    Phan, Nguyet-Minh
    Zelinka, Ivan
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2021, 24 (02) : 701 - 721
  • [26] OCR error correction using correction patterns and self-organizing migrating algorithm
    Quoc-Dung Nguyen
    Duc-Anh Le
    Nguyet-Minh Phan
    Ivan Zelinka
    [J]. Pattern Analysis and Applications, 2021, 24 : 701 - 721
  • [27] Arabic OCR Evaluation Tool
    Alghamdi, Mansoor A.
    Alkhazi, Ibrahim S.
    Teahan, William J.
    [J]. 2016 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (CSIT), 2016,
  • [28] OCR error correction of an inflectional Indian language using morphological parsing
    Pal, U
    Kundu, PK
    Chaudhuri, BB
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2000, 16 (06) : 903 - 922
  • [29] AUTOMATIC ERROR-CORRECTION AND QUERY EVALUATION OF OCR GENERATED TEXT
    TAGHVA, K
    BORSACK, J
    CONDIT, A
    [J]. ONLINE & CDROM REVIEW, 1994, 18 (01): : 47 - 47
  • [30] Spelling Error Detection and Correction for Arabic Using NooJ
    Kassmi, Rafik
    Mbarki, Samir
    Mouloudi, Abdelaziz
    [J]. Communications in Computer and Information Science, 2024, 1816 CCIS : 202 - 212