Script line separation from Indian multi-script documents

被引：16

作者：

Pal, U ^{[1
]}

Chaudhuri, BB ^{[1
]}

机构：

[1] Indian Stat Inst, Comp Vis & Pattern Recognit Unit, Kolkata 700108, India

来源：

IETE JOURNAL OF RESEARCH | 2003年 / 49卷 / 01期

关键词：

optical character recognition (OCR); document processing; Indian scripts and languages; multi-lingual and multi-script documents;

D O I：

10.1080/03772063.2003.11416318

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In a multi-lingual country like India, a document page may contain more than one script form. Under the three-language formula, the document may be printed in English, Devnagari and one of the other Indian official languages. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate these three script forms before feeding them to the OCRs of individual scripts. In this paper an automatic technique of separating the text lines is presented for almost all triplet of script forms. To do so, the triplets are grouped into five classes according to their characteristics, and shape based features have been employed to separate them without any expensive OCR-like algorithms. The proposed approaches are tested-on many documents and the experimental results are presented. At present, the system has an overall accuracy of about 98.5%.

引用

页码：3 / 11

页数：9

共 50 条

[1] Multi-script line identification from Indian documents
Pal, U
Sinha, S
Chaudhuri, BB
SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 880 - 884
[2] HVS inspired system for script identification in Indian multi-script documents
Pati, PB
Ramakrishnan, AG
DOCUMENT ANALYSIS SYSTEMS VII, PROCEEDINGS, 2006, 3872 : 380 - 389
[3] Script Identification of Multi-Script Documents: A Survey
Ubul, Kurban
Tursun, Gulzira
Aysa, Alimjan
Impedovo, Donato
Pirlo, Giuseppe
Yibulayin, Tuergen
IEEE ACCESS, 2017, 5 : 6546 - 6559
[4] Automatic separation of words in multi-lingual multi-script Indian documents
Pal, U
Chaudhuri, BB
PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS 1 AND 2, 1997, : 576 - 579
[5] Identification of different script lines from multi-script documents
Pal, U
Chaudhuri, BB
IMAGE AND VISION COMPUTING, 2002, 20 (13-14) : 945 - 954
[6] A blind indic script recognizer for multi-script documents
Pati, Peeta Basa
Ramakrishnan, A. G.
ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 1248 - 1252
[7] Statistical comparison of classifiers for script identification from multi-script handwritten documents
Singh, Pawan Kumar
Sarkar, Ram
Das, Nibaran
Basu, Subhadip
Nasipuri, Mita
INTERNATIONAL JOURNAL OF APPLIED PATTERN RECOGNITION, 2014, 1 (02) : 152 - 172
[8] Page-level Script Identification from Multi-script Handwritten Documents
Singh, Pawan Kumar
Dalal, Santu Kumar
Sarkar, Ram
Nasipuri, Mita
2015 THIRD INTERNATIONAL CONFERENCE ON COMPUTER, COMMUNICATION, CONTROL AND INFORMATION TECHNOLOGY (C3IT), 2015,
[9] Word-Level Script Identification from Handwritten Multi-script Documents
Singh, Pawan Kumar
Mondal, Arafat
Bhowmik, Showmik
Sarkar, Ram
Nasipuri, Mita
PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON FRONTIERS OF INTELLIGENT COMPUTING: THEORY AND APPLICATIONS (FICTA) 2014, VOL 1, 2015, 327 : 551 - 558
[10] A generalized line segmentation method for multi-script handwritten text documents
Rakshit, Payel
Halder, Chayan
Md Obaidullah, Sk
Roy, Kaushik
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 212

← 1 2 3 4 5 →