Automatic separation of words in multi-lingual multi-script Indian documents

被引:0
|
作者
Pal, U
Chaudhuri, BB
机构
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In a multi-lingual country like India, a document may contain more than one script forms. For such a document it is necessary to separate different script forms before feeding them to OCRs of individual script. In this paper an automatic word segmentation approach is described which can separate Roman, Bangla and Devnagari scripts present in a single document. The approach has a tree structure where at first Roman script words are separated using the 'headline' feature. The headline is common in Bangla and Devnagari but absent in Roman. Next, Bangla and Devnagari words are separated using some finer characteristics of the character set although recognition of individual character is avoided. At present, the system has an overall accuracy of 96.09%.
引用
收藏
页码:576 / 579
页数:4
相关论文
共 50 条
  • [1] Script line separation from Indian multi-script documents
    Pal, U
    Chaudhuri, BB
    [J]. IETE JOURNAL OF RESEARCH, 2003, 49 (01) : 3 - 11
  • [2] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
    Miguel A. Ferrer
    Abhijit Das
    Moises Diaz
    Aythami Morales
    Cristina Carmona-Duarte
    Umapada Pal
    [J]. Cognitive Computation, 2024, 16 (1) : 131 - 157
  • [3] MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
    Ferrer, Miguel A.
    Das, Abhijit
    Diaz, Moises
    Morales, Aythami
    Carmona-Duarte, Cristina
    Pal, Umapada
    [J]. COGNITIVE COMPUTATION, 2024, 16 (01) : 131 - 157
  • [4] Automatic Multi-lingual Script Recognition Application
    Abu-Ain, Waleed Abdel Karim
    Abdullah, Siti Norul Huda Sheikh
    Omar, Khairuddin
    Abd Rahman, Siti Zaharah
    [J]. GEMA ONLINE JOURNAL OF LANGUAGE STUDIES, 2018, 18 (03): : 203 - 221
  • [5] Multi-script line identification from Indian documents
    Pal, U
    Sinha, S
    Chaudhuri, BB
    [J]. SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 880 - 884
  • [6] HVS inspired system for script identification in Indian multi-script documents
    Pati, PB
    Ramakrishnan, AG
    [J]. DOCUMENT ANALYSIS SYSTEMS VII, PROCEEDINGS, 2006, 3872 : 380 - 389
  • [7] Script Identification of Multi-Script Documents: A Survey
    Ubul, Kurban
    Tursun, Gulzira
    Aysa, Alimjan
    Impedovo, Donato
    Pirlo, Giuseppe
    Yibulayin, Tuergen
    [J]. IEEE ACCESS, 2017, 5 : 6546 - 6559
  • [8] A blind indic script recognizer for multi-script documents
    Pati, Peeta Basa
    Ramakrishnan, A. G.
    [J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 1248 - 1252
  • [9] Multi-script Identification from Printed Words
    Jetley, Saumya
    Mehrotra, Kapil
    Vaze, Atish
    Belhe, Swapnil
    [J]. IMAGE ANALYSIS AND RECOGNITION, ICIAR 2014, PT I, 2014, 8814 : 359 - 368
  • [10] A Multi-Lingual Dictionary of Dirty Words
    Sjoebergh, Jonas
    Araki, Kenji
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 509 - 512