Script line separation from Indian multi-script documents

被引:16
|
作者
Pal, U [1 ]
Chaudhuri, BB [1 ]
机构
[1] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, Kolkata 700108, India
关键词
optical character recognition (OCR); document processing; Indian scripts and languages; multi-lingual and multi-script documents;
D O I
10.1080/03772063.2003.11416318
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other Indian official languages. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper an automatic technique of separating the text lines is presented for almost all triplet of script forms. To do so, the triplets are grouped into five classes according to their characteristics, and shape based features have been employed to separate them without any expensive OCR-like algorithms. The proposed approaches are tested-on many documents and the experimental results are presented. At present, the system has an overall accuracy of about 98.5%.
引用
收藏
页码:3 / 11
页数:9
相关论文
共 50 条
  • [41] Word-wise script identification from Indian documents
    Sinha, S
    Pal, U
    Chaudhuri, BB
    DOCUMENT ANALYSIS SYSTEMS VI, PROCEEDINGS, 2004, 3163 : 310 - 321
  • [42] Hybrid HMM/BLSTM system for multi-script keyword spotting in printed and handwritten documents with identification stage
    Ahmed Cheikhrouhou
    Yousri Kessentini
    Slim Kanoun
    Neural Computing and Applications, 2020, 32 : 9201 - 9215
  • [43] Recognition of Numeric Postal Codes from Multi-script Postal Address Blocks
    Basu, Subhadip
    Das, Nibaran
    Sarkar, Ram
    Kundu, Mahantapas
    Nasipuri, Mita
    Basu, Dipak Kumar
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2009, 5909 : 381 - 386
  • [44] Hybrid HMM/BLSTM system for multi-script keyword spotting in printed and handwritten documents with identification stage
    Cheikhrouhou, Ahmed
    Kessentini, Yousri
    Kanoun, Slim
    NEURAL COMPUTING & APPLICATIONS, 2020, 32 (13): : 9201 - 9215
  • [45] Multi-Script Video Caption Localization Based on Visual Rhythms
    Roberto e Souza, Marcos
    Maia, Helena de Almeida
    Souza e Santos, Anderson Carlos
    Vieira, Marcelo Bernardes
    Pedrini, Helio
    APPLIED ARTIFICIAL INTELLIGENCE, 2022, 36 (01)
  • [46] A Study on Word-Level Multi-script Identification from Video Frames
    Sharma, Nabin
    Pal, Umapada
    Blumenstein, Michael
    PROCEEDINGS OF THE 2014 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2014, : 1827 - 1833
  • [47] LAMIS-MSHD: A Multi-Script offline Handwriting Database
    Djeddi, Chawki
    Siddiqi, Imran
    Gattal, Abdeljalil
    Chibani, Youcef
    Souici-Meslati, Labiba
    El Abed, Haikal
    2014 14TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2014, : 93 - 97
  • [48] Multi-script handwritten digit recognition using multi-task learning
    Gondere, Mesay Samuel
    Schmidt-Thieme, Lars
    Sharma, Durga Prasad
    Scholz, Randolf
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 43 (01) : 355 - 364
  • [49] Skew angle detection of digitized Indian script documents
    Chaudhuri, BB
    Pal, U
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (02) : 182 - 186
  • [50] Benchmarking Automatic Multi-script Scene Component Transcription for AUTNT Dataset
    Rahamatulla
    Mollah, Ayatullah Faruk
    SENSING AND IMAGING, 2021, 22 (01):