Probabilistic retrieval of OCR degraded text using N-grams

被引:0
|
作者
Harding, SM [1 ]
Croft, WB
Weir, C
机构
[1] Univ Massachusetts, CIIR, Amherst, MA 01003 USA
[2] Lockheed Martin C2 Syst, Malvern, PA 19355 USA
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The retrieval of OCR degraded text using n-gram formulations within a probabilistic retrieval system is examined in this paper. Direct retrieval of documents using n-gram databases of 2 and 3-grams or 2, 3, 4 and 5-grams resulted in improved retrieval performance over standard (word based) queries on the same data when a level of 10 percent degradation or worse was achieved. A second method of using n-grams to identify appropriate matching and near matching terms for query expansion which also performed better than using standard queries is also described. This method was less effective than direct n-gram query formulations but can likely be improved with alternative query component weighting schemes and measures of term similarity. Finally, a web based retrieval application using n-gram retrieval of OCR text and display, with query term highlighting, of the source document image is described.
引用
收藏
页码:345 / 359
页数:15
相关论文
共 50 条
  • [21] Better text compression from fewer lexical n-grams
    Smith, TC
    Lorenz, M
    [J]. DCC 2001: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2001, : 516 - 516
  • [22] N-grams based feature selection and text representation for Chinese text classification
    Department of Computer Science and Engineering, Tongji University, Cao'an Road, 4800, Shanghai, 201804, China
    不详
    不详
    [J]. Int. J. Comput. Intell. Syst., 2009, 4 (365-374):
  • [23] N-grams based feature selection and text representation for Chinese Text Classification
    Zhihua Wei
    Duoqian Miao
    Jean Hugues Chauchat
    Rui Zhao
    Wen Li
    [J]. International Journal of Computational Intelligence Systems, 2009, 2 (4) : 365 - 374
  • [24] UNORDERED N-GRAMS: NEW APPROACH IN TEXT PLAGIARISM DETECTION
    Pribil, Jiri
    Leseticky, Ondrej
    Kubalova, Kamila
    [J]. INFORMATION TECHNOLOGIES' 2009, 2009, : 243 - 249
  • [25] SPEECH RECOGNITION USING FUNCTION-WORD N-GRAMS AND CONTENT-WORD N-GRAMS
    ISOTANI, R
    MATSUNAGA, S
    SAGAYAMA, S
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1995, E78D (06) : 692 - 697
  • [26] N-grams based feature selection and text representation for Chinese Text Classification
    Wei, Zhihua
    Miao, Duoqian
    Chauchat, Jean-Hugues
    Zhao, Rui
    Li, Wen
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2009, 2 (04) : 365 - 374
  • [27] Empirical evaluations using character and word N-grams on authorship attribution for Telugu text
    Nagaprasad, S.
    Raghunadha Reddy, T.
    Vijayapal Reddy, P.
    Vinaya Babu, A.
    VishnuVardhan, B.
    [J]. Advances in Intelligent Systems and Computing, 2015, 343 : 613 - 623
  • [28] PLAGIARISM DETECTION IN TEXT DOCUMENTS USING SENTENCE BOUNDED STOP WORD N-GRAMS
    Gupta, Deepa
    Vani, K.
    Leema, L. M.
    [J]. JOURNAL OF ENGINEERING SCIENCE AND TECHNOLOGY, 2016, 11 (10) : 1403 - 1420
  • [29] Question Answering Passage Retrieval and Re-ranking Using N-grams and SVM
    Othman, Nouha
    Faiz, Rim
    [J]. COMPUTACION Y SISTEMAS, 2016, 20 (03): : 483 - 494
  • [30] Plagiarism Detection Using Stopword n-grams
    Stamatatos, Efstathios
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2011, 62 (12): : 2512 - 2527