Script line separation from Indian multi-script documents

被引:16
|
作者
Pal, U [1 ]
Chaudhuri, BB [1 ]
机构
[1] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, Kolkata 700108, India
关键词
optical character recognition (OCR); document processing; Indian scripts and languages; multi-lingual and multi-script documents;
D O I
10.1080/03772063.2003.11416318
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other Indian official languages. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper an automatic technique of separating the text lines is presented for almost all triplet of script forms. To do so, the triplets are grouped into five classes according to their characteristics, and shape based features have been employed to separate them without any expensive OCR-like algorithms. The proposed approaches are tested-on many documents and the experimental results are presented. At present, the system has an overall accuracy of about 98.5%.
引用
收藏
页码:3 / 11
页数:9
相关论文
共 50 条
  • [21] Multi-script handwriting recognition with FOHDEL
    Malaviya, A
    Leja, C
    Peters, L
    1996 BIENNIAL CONFERENCE OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY - NAFIPS, 1996, : 147 - 151
  • [22] Multi-script versus single-script scenarios in automatic off-line signature verification
    Das, Abhijit
    Ferrer, Miguel A.
    Pal, Umapada
    Pal, Srikanta
    Diaz, Moises
    Blumenstein, Michael
    IET BIOMETRICS, 2016, 5 (04) : 305 - 313
  • [23] Handwritten Indic Script Identification in Multi-Script Document Images: A Survey
    Obaidullah, Sk Md
    Santosh, K. C.
    Das, Nibaran
    Halder, Chayan
    Roy, Kaushik
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2018, 32 (10)
  • [24] Multi-script Text Detection from Images: A Survey
    Dadiya, Nidhi J.
    Goswami, Mukesh M.
    2019 INNOVATIONS IN POWER AND ADVANCED COMPUTING TECHNOLOGIES (I-PACT), 2019,
  • [25] A multilingual multi-script database of Indian theses: Implementation of unicode at Vidyanidhi
    Urs, SR
    Harinarayana, NS
    Kumbar, M
    DIGITAL LIBRARIES: PEOPLE, KNOWLEDGE, AND TECHNOLOGY, PROCEEDINGS, 2002, 2555 : 305 - 314
  • [26] Multi-script Text Extraction from Natural Scenes
    Gomez, Lluis
    Karatzas, Dimosthenis
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 467 - 471
  • [27] Word level multi-script identification
    Pati, Peeta Basa
    Ramakrishnan, A. G.
    PATTERN RECOGNITION LETTERS, 2008, 29 (09) : 1218 - 1229
  • [28] Multi-skew detection of Indian script documents
    Pal, U
    Mitra, M
    Chaudhuri, BB
    SIXTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, PROCEEDINGS, 2001, : 292 - 296
  • [29] A novel framework for automatic sorting of postal documents with multi-script address blocks
    Basu, Subhadip
    Das, Nibaran
    Sarkar, Ram
    Kundu, Mahantapas
    Nasipuri, Mita
    Basu, Dipak Kumar
    PATTERN RECOGNITION, 2010, 43 (10) : 3507 - 3521
  • [30] Multi-script Writer Identification using Dissimilarity
    Bertolini, Diego
    Oliveira, Luiz S.
    Sabourin, Robert
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 3025 - 3030