Effect of OCR error correction on Arabic retrieval

被引:0
|
作者
Walid Magdy
Kareem Darwish
机构
[1] Cairo Microsoft Innovation Center,
来源
Information Retrieval | 2008年 / 11卷
关键词
OCR; Language modeling; Information retrieval; Error correction;
D O I
暂无
中图分类号
学科分类号
摘要
Arabic documents that are available only in print continue to be ubiquitous and they can be scanned and subsequently OCR’ed to ease their retrieval. This paper explores the effect of context-based OCR correction on the effectiveness of retrieving Arabic OCR documents using different index terms. Different OCR correction techniques based on language modeling with different correction abilities were tested on real OCR and synthetic OCR degradation. Results show that the reduction of word error rates needs to pass a certain limit to get a noticeable effect on retrieval. If only moderate error reduction is available, then using short character n-gram for retrieval without error correction is not a bad strategy. Word-based correction in conjunction with language modeling had a statistically significant impact on retrieval even for character 3-grams, which are known to be among the best index terms for OCR degraded Arabic text. Further, using a sufficiently large language model for correction can minimize the need for morphologically sensitive error correction.
引用
收藏
页码:405 / 425
页数:20
相关论文
共 50 条
  • [1] Effect of OCR error correction on Arabic retrieval
    Magdy, Walid
    Darwish, Kareern
    [J]. INFORMATION RETRIEVAL, 2008, 11 (05): : 405 - 425
  • [2] Error correction vs. query garbling for Arabic OCR document retrieval
    Darwish, Kareem
    Magdy, Walid
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2008, 26 (01)
  • [3] Word-based correction tor retrieval of arabic OCR degraded documents
    Magdy, Walid
    Darwish, Kareem
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2006, 4209 : 205 - 216
  • [4] OCR Error Correction Using BiLSTM
    Kayabas, Ayla
    Topcu, Ahmet E.
    Kilic, Ozkan
    [J]. INTERNATIONAL CONFERENCE ON ELECTRICAL, COMPUTER AND ENERGY TECHNOLOGIES (ICECET 2021), 2021, : 2083 - 2087
  • [5] Statistical learning for OCR error correction
    Mei, Jie
    Islam, Aminul
    Moh'd, Abidalrahman
    Wu, Yajing
    Milios, Evangelos
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2018, 54 (06) : 874 - 887
  • [6] A Spell Correction Model for OCR Errors for Arabic Text
    Muhammad, Mariam
    ELGhazaly, Tarek
    Ezzat, Mostafa
    Gheith, Mervat
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT SYSTEMS AND INFORMATICS 2016, 2017, 533 : 124 - 136
  • [7] An efficient scheme for tilt correction in Arabic OCR system
    Sarfraz, M
    Shahab, SA
    [J]. Computer Graphics, Imaging and Vision: New Trends, 2005, : 379 - 384
  • [8] OCR Error Correction for Vietnamese OCR Text with Different Edit Distances
    Quoc-Dung Nguyen
    Nguyet-Minh Phan
    Kromer, Pavel
    [J]. ADVANCES IN INTELLIGENT NETWORKING AND COLLABORATIVE SYSTEMS, INCOS-2022, 2022, 527 : 130 - 139
  • [9] An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
    Nguyen, Quoc-Dung
    Phan, Nguyet-Minh
    Kromer, Pavel
    Le, Duc-Anh
    [J]. IEEE ACCESS, 2023, 11 : 58406 - 58421
  • [10] English/Arabic Cross Language Information Retrieval (CLIR) for Arabic OCR-Degraded Text
    Elghazaly, Tarek A.
    Fahmy, Aly A.
    [J]. INNOVATION AND KNOWLEDGE MANAGEMENT IN TWIN TRACK ECONOMIES: CHALLENGES & SOLUTIONS, VOLS 1-3, 2009, : 942 - 952