Probabilistic retrieval of OCR degraded text using N-grams

被引:0
|
作者
Harding, SM [1 ]
Croft, WB
Weir, C
机构
[1] Univ Massachusetts, CIIR, Amherst, MA 01003 USA
[2] Lockheed Martin C2 Syst, Malvern, PA 19355 USA
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.
引用
收藏
页码:345 / 359
页数:15
相关论文
共 50 条
  • [1] Using N-grams for arabic text searching
    Mustafa, SH
    Al-Radaideh, QA
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2004, 55 (11): : 1002 - 1007
  • [2] Evaluation of N-grams conflation approach in text-based information retrieval
    Kosinov, S
    [J]. EIGHTH SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2001, : 136 - 142
  • [3] Experiments in spoken document retrieval using phoneme n-grams
    Ng, C
    Wilkinson, R
    Zobel, J
    [J]. SPEECH COMMUNICATION, 2000, 32 (1-2) : 61 - 77
  • [4] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    [J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [5] Sentence Classification Using N-Grams in Urdu Language Text
    Awan, Malik Daler Ali
    Ali, Sikandar
    Samad, Ali
    Iqbal, Nadeem
    Missen, Malik Muhammad Saad
    Ullah, Niamat
    [J]. SCIENTIFIC PROGRAMMING, 2021, 2021
  • [6] CONTINUOUS MODELS OF AFFECT FROM TEXT USING N-GRAMS
    Malandrakis, Nikolaos
    Potamianos, Alexandros
    Narayanan, Shrikanth
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 8500 - 8504
  • [7] Robust polyphonic music retrieval with N-grams
    Doraisamy, S
    Rüeger, S
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2003, 21 (01) : 53 - 70
  • [8] Robust Polyphonic Music Retrieval with N-grams
    Shyamala Doraisamy
    Stefan Rüger
    [J]. Journal of Intelligent Information Systems, 2003, 21 : 53 - 70
  • [9] Part of speech n-grams and Information Retrieval
    Lioma, Christina
    van Rijsbergen, C. J. Keith
    [J]. REVUE FRANCAISE DE LINGUISTIQUE APPLIQUEE, 2008, 13 (01): : 9 - 22
  • [10] ROBUST MODELING OF MUSICAL CHORD SEQUENCES USING PROBABILISTIC N-GRAMS
    Scholz, Ricardo
    Vincent, Emmanuel
    Bimbot, Frederic
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 53 - 56