Retrieval methods for English-text with misrecognized OCR characters

被引：0

作者：

Ohta, M

Takasu, A

Adachi, J

机构：

来源：

PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2 | 1997年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents three probabilistic text retrieval methods designed to carry out a full-text search of English documents containing OCR errors. By searching for any query term on the premise that there are errors in the recognized text, the methods presented can tolerate such errors, and therefore costly manual postediting is trot required after OCR recognition. In the applied approach, confusion matrices are used to store characters which are likely to be interchanged when a particular character is missrecognized, and the respective probability of each occurrence. Moreover, a 2-gram matrix is used to store probabilities of character connection, i.e., which letter is likely to come after another. Multiple search terms are generated far an input query term by making reference to confusion matrices, after which a full-text search is run far each search term, The validity of retrieved terms is determined based on error-occurrence and character-connection probabilities. The performance of these methods is experimentally evaluated by determining retrieval effectiveness, i.e., by calculating recall and precision rates. Results indicate marked improvement in comparison with exact matching.

引用

下载

页码：950 / 956

页数：7

共 50 条

[1] Probabilistic automaton model for fuzzy English-text retrieval
Ohta, M
Takasu, A
Adachi, J
RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, PROCEEDINGS, 2000, 1923 : 35 - 44
[2] Reduction of expanded search terms for fuzzy English-text retrieval
Ohta M.
Takasu A.
Adachi J.
International Journal on Digital Libraries, 2000, 3 (2) : 140 - 151
[3] Probabilistic automaton-based fuzzy english-text retrieval
Ohta, M
Takasu, A
Adachi, J
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2003, E86D (09) : 1835 - 1844
[4] VOICE INPUT TO ENGLISH-TEXT OUTPUT
BOOTH, AW
BARNDEN, MS
INTERNATIONAL JOURNAL OF MAN-MACHINE STUDIES, 1979, 11 (06): : 681 - 691
[5] ENGLISH-TEXT OF THE 'ANCRENE RIWLE' - ZETTERSTEN,A
DEVRIES, FC
NOTES AND QUERIES, 1979, 26 (05) : 443 - 444
[6] ENGLISH-TEXT OF THE 'ANCRENE RIWLE' - ZETTERSTEN,A
EDWARDS, ASG
ENGLISH STUDIES, 1979, 60 (01) : 82 - 83
[7] THE ENGLISH-TEXT OF THE 'ANCRENE RIWLE' - DOBSON,EJ
DEVRIES, FC
NOTES AND QUERIES, 1975, 22 (07) : 320 - 322
[8] ENGLISH-TEXT OF THE 'ANCRENE RIWLE' - COTTON,BM
GORLACH, M
ANGLIA-ZEITSCHRIFT FUR ENGLISCHE PHILOLOGIE, 1975, 93 (1-2): : 222 - 225
[9] A SYSTEM FOR PROCESSING BILINGUAL ARABIC ENGLISH-TEXT
MUSA, FA
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1986, 37 (05): : 288 - 293
[10] ENGLISH-TEXT OF THE 'ANCRENE RIWLE' - DOBSON,EJ
WEINSTOCK, H
ENGLISH STUDIES, 1976, 57 (02) : 159 - 163

← 1 2 3 4 5 →