Retrieval methods for English-text with misrecognized OCR characters

被引:0
|
作者
Ohta, M
Takasu, A
Adachi, J
机构
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual postediting is trot required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated far an input query term by making reference to confusion matrices, after which a full-text search is run far each search term, The validity of retrieved terms is determined based on error-occurrence and character-connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.
引用
收藏
页码:950 / 956
页数:7
相关论文
共 50 条
  • [21] METHODS FOR TEXT RETRIEVAL
    FELICIAN, L
    ELETTROTECNICA, 1990, 77 (11): : 1037 - 1045
  • [22] 'WHAT A LIFE' + A SHORT ENGLISH-TEXT FOR INTENSIVE READING, INTRODUCED BY HEDBERG,JOHANNES
    不详
    MODERNA SPRAK, 1986, 80 (01): : 69 - 71
  • [23] GOSPEL ACCORDING TO MARK - ENGLISH-TEXT WITH INTRODUCTION, EXPOSITION AND NOTES - LANE,WL
    HOERBER, RG
    LUTHERAN QUARTERLY, 1976, 28 (04) : 390 - 390
  • [24] BERKELEY,G. 'OF INFINITIES' + ENGLISH-TEXT WITH FRENCH TRANSLATION AND NOTES BY BERLIOZLETELLIER,DOMINIQUE
    BERKELEY, G
    REVUE PHILOSOPHIQUE DE LA FRANCE ET DE L ETRANGER, 1982, 107 (01): : 45 - 57
  • [25] THE GOSPEL ACCORDING TO MARK - THE ENGLISH-TEXT WITH INTRODUCTION, EXPOSITION AND NOTES, - LANE,WL
    KEE, HC
    JOURNAL OF BIBLICAL LITERATURE, 1975, 94 (03) : 460 - 461
  • [26] Probabilistic retrieval of OCR degraded text using N-grams
    Harding, SM
    Croft, WB
    Weir, C
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 1997, 1324 : 345 - 359
  • [27] 'AUS DEM ERSTEN GESANG' + TRANSLATION BY KEMP,FRIEDHELM PLUS ORIGINAL FRENCH TEXT AND SYLVESTER ENGLISH-TEXT
    DUBARTAS, G
    AKZENTE-ZEITSCHRIFT FUR LITERATUR, 1984, 31 (06): : 501 - 503
  • [28] DICKINSON,EMILY + POETRY, TRANSLATED BY KEMP,FRIEDHELM, PLUS ORIGINAL ENGLISH-TEXT - INTRODUCTION
    不详
    AKZENTE-ZEITSCHRIFT FUR LITERATUR, 1984, 31 (06): : 518 - 518
  • [29] Scanned english document retrieval based on OCR and word shape coding
    Xia, Yong
    Dai, Ru-Wei
    Xiao, Bai-Hua
    Wang, Chun-Heng
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2009, 22 (03): : 488 - 493
  • [30] A document retrieval method from handwritten characters based on OCR and character shape information
    Kameshiro, T
    Hirano, T
    Okada, Y
    Yoda, F
    SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 597 - 601