Script line separation from Indian multi-script documents

被引:16
|
作者
Pal, U [1 ]
Chaudhuri, BB [1 ]
机构
[1] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, Kolkata 700108, India
关键词
optical character recognition (OCR); document processing; Indian scripts and languages; multi-lingual and multi-script documents;
D O I
10.1080/03772063.2003.11416318
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other Indian official languages. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper an automatic technique of separating the text lines is presented for almost all triplet of script forms. To do so, the triplets are grouped into five classes according to their characteristics, and shape based features have been employed to separate them without any expensive OCR-like algorithms. The proposed approaches are tested-on many documents and the experimental results are presented. At present, the system has an overall accuracy of about 98.5%.
引用
收藏
页码:3 / 11
页数:9
相关论文
共 50 条
  • [31] MULTI-SCRIPT MODIFICATION OF MEDICATION LIST OF POMR
    SLOCUM, H
    CAPUT, WG
    JOURNAL OF FAMILY PRACTICE, 1977, 5 (01): : 131 - 133
  • [32] Multi-script Iterative Steerable Directional Filtering For Handwritten Text Line Extraction
    Swaileh, Wassim
    Mohand, Kamel Ait
    Paquet, Thierry
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1241 - 1245
  • [33] Identification of different scripts lines from multi-script documents (vol 20, pg 945, 2002)
    Pal, U
    Chaudhuri, BB
    IMAGE AND VISION COMPUTING, 2003, 21 (11) : 1017 - 1017
  • [34] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
    Ferrer, Miguel A.
    Das, Abhijit
    Diaz, Moises
    Morales, Aythami
    Carmona-Duarte, Cristina
    Pal, Umapada
    arXiv,
  • [35] Multi-script Writer Identification Optimized With Retrieval Mechanism
    Djeddi, Chawki
    Siddiqi, Imran
    Souici-Meslati, Labiba
    Ennaji, Abdellatif
    13TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR 2012), 2012, : 509 - 514
  • [36] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
    Ferrer, Miguel A.
    Das, Abhijit
    Diaz, Moises
    Morales, Aythami
    Carmona-Duarte, Cristina
    Pal, Umapada
    COGNITIVE COMPUTATION, 2024, 16 (01) : 131 - 157
  • [37] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
    Miguel A. Ferrer
    Abhijit Das
    Moises Diaz
    Aythami Morales
    Cristina Carmona-Duarte
    Umapada Pal
    Cognitive Computation, 2024, 16 (1) : 131 - 157
  • [38] Feature learning and encoding for multi-script writer identification
    Abdelillah Semma
    Yaâcoub Hannad
    Imran Siddiqi
    Said Lazrak
    Mohamed El Youssfi El Kettani
    International Journal on Document Analysis and Recognition (IJDAR), 2022, 25 : 79 - 93
  • [39] Feature learning and encoding for multi-script writer identification
    Semma, Abdelillah
    Hannad, Yaacoub
    Siddiqi, Imran
    Lazrak, Said
    El Kettani, Mohamed El Youssfi
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2022, 25 (02) : 79 - 93
  • [40] ICFHR 2018 Competition on Multi-Script Writer Identification
    Djeddi, Chawki
    Al-Maadeed, Somaya
    Siddiqi, Imran
    Gattal, Abdeljalil
    He, Sheng
    Akbari, Younes
    PROCEEDINGS 2018 16TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2018, : 506 - 510