Script line separation from Indian multi-script documents

被引：16

作者：

Pal, U ^{[1
]}

Chaudhuri, BB ^{[1
]}

机构：

[1] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, Kolkata 700108, India

来源：

IETE JOURNAL OF RESEARCH | 2003年 / 49卷 / 01期

关键词：

optical character recognition (OCR); document processing; Indian scripts and languages; multi-lingual and multi-script documents;

D O I：

10.1080/03772063.2003.11416318

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other Indian official languages. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper an automatic technique of separating the text lines is presented for almost all triplet of script forms. To do so, the triplets are grouped into five classes according to their characteristics, and shape based features have been employed to separate them without any expensive OCR-like algorithms. The proposed approaches are tested-on many documents and the experimental results are presented. At present, the system has an overall accuracy of about 98.5%.

引用

页码：3 / 11

页数：9

共 50 条

[41] Word-wise script identification from Indian documents
Sinha, S
Pal, U
Chaudhuri, BB
DOCUMENT ANALYSIS SYSTEMS VI, PROCEEDINGS, 2004, 3163 : 310 - 321
[42] Hybrid HMM/BLSTM system for multi-script keyword spotting in printed and handwritten documents with identification stage
Ahmed Cheikhrouhou
Yousri Kessentini
Slim Kanoun
Neural Computing and Applications, 2020, 32 : 9201 - 9215
[43] Recognition of Numeric Postal Codes from Multi-script Postal Address Blocks
Basu, Subhadip
Das, Nibaran
Sarkar, Ram
Kundu, Mahantapas
Nasipuri, Mita
Basu, Dipak Kumar
PATTERN RECOGNITION AND MACHINE INTELLIGENCE, PROCEEDINGS, 2009, 5909 : 381 - 386
[44] Hybrid HMM/BLSTM system for multi-script keyword spotting in printed and handwritten documents with identification stage
Cheikhrouhou, Ahmed
Kessentini, Yousri
Kanoun, Slim
NEURAL COMPUTING & APPLICATIONS, 2020, 32 (13): : 9201 - 9215
[45] Multi-Script Video Caption Localization Based on Visual Rhythms
Roberto e Souza, Marcos
Maia, Helena de Almeida
Souza e Santos, Anderson Carlos
Vieira, Marcelo Bernardes
Pedrini, Helio
APPLIED ARTIFICIAL INTELLIGENCE, 2022, 36 (01)
[46] A Study on Word-Level Multi-script Identification from Video Frames
Sharma, Nabin
Pal, Umapada
Blumenstein, Michael
PROCEEDINGS OF THE 2014 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2014, : 1827 - 1833
[47] LAMIS-MSHD: A Multi-Script offline Handwriting Database
Djeddi, Chawki
Siddiqi, Imran
Gattal, Abdeljalil
Chibani, Youcef
Souici-Meslati, Labiba
El Abed, Haikal
2014 14TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR), 2014, : 93 - 97
[48] Multi-script handwritten digit recognition using multi-task learning
Gondere, Mesay Samuel
Schmidt-Thieme, Lars
Sharma, Durga Prasad
Scholz, Randolf
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 43 (01) : 355 - 364
[49] Skew angle detection of digitized Indian script documents
Chaudhuri, BB
Pal, U
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1997, 19 (02) : 182 - 186
[50] Benchmarking Automatic Multi-script Scene Component Transcription for AUTNT Dataset
Rahamatulla
Mollah, Ayatullah Faruk
SENSING AND IMAGING, 2021, 22 (01):

← 1 2 3 4 5 →