Automatic separation of words in multi-lingual multi-script Indian documents

被引:0
|
作者
Pal, U
Chaudhuri, BB
机构
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In a multi-lingual country like India, a document may contain more than one script forms. For such a document it is necessary to separate different script forms before feeding them to OCRs of individual script. In this paper an automatic word segmentation approach is described which can separate Roman, Bangla and Devnagari scripts present in a single document. The approach has a tree structure where at first Roman script words are separated using the 'headline' feature. The headline is common in Bangla and Devnagari but absent in Roman. Next, Bangla and Devnagari words are separated using some finer characteristics of the character set although recognition of individual character is avoided. At present, the system has an overall accuracy of 96.09%.
引用
收藏
页码:576 / 579
页数:4
相关论文
共 50 条
  • [41] Word-Level Thirteen Official Indic Languages Database for Script Identification in Multi-script Documents
    Obaidullah, Sk Md
    Santosh, K. C.
    Halder, Chayan
    Das, Nibaran
    Roy, Kaushik
    [J]. RECENT TRENDS IN IMAGE PROCESSING AND PATTERN RECOGNITION (RTIP2R 2016), 2017, 709 : 16 - 27
  • [42] A Texture based approach to Word-level Script Identification from Multi-script Handwritten Documents
    Singh, Pawan Kumar
    Khan, Aparajita
    Sarkar, Ram
    Nasipuri, Mita
    [J]. 2014 6TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS, 2014, : 228 - 232
  • [43] Multi-lingual Transformer Training for Khmer Automatic Speech Recognition
    Soky, Kak
    Li, Sheng
    Kawahara, Tatsuya
    Seng, Sopheap
    [J]. 2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 1893 - 1896
  • [44] A Low Resource Multi-lingual Simultaneous Script Identification and Text Recognition Model
    Jayati Mukherjee
    Utpal Roy
    [J]. SN Computer Science, 5 (6)
  • [45] Automatic Focus Personage Identification in Multi-lingual News Image
    Su, Xueping
    Zhou, Hangchi
    [J]. 2017 INTERNATIONAL CONFERENCE ON THE FRONTIERS AND ADVANCES IN DATA SCIENCE (FADS), 2017, : 74 - 79
  • [46] Automatic learning of numeral grammars for multi-lingual speech synthesizers
    Flach, G
    Holzapfel, M
    Just, C
    Wachtler, A
    Wolff, M
    [J]. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1291 - 1294
  • [47] Automatic identification of focus personage in multi-lingual news images
    Su, Xueping
    Zhu, Danyao
    Ren, Jie
    Raetsch, Matthias
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (07) : 11015 - 11030
  • [48] Automatic identification of focus personage in multi-lingual news images
    Xueping Su
    Danyao Zhu
    Jie Ren
    Matthias Rätsch
    [J]. Multimedia Tools and Applications, 2021, 80 : 11015 - 11030
  • [49] Word level multi-script identification
    Pati, Peeta Basa
    Ramakrishnan, A. G.
    [J]. PATTERN RECOGNITION LETTERS, 2008, 29 (09) : 1218 - 1229
  • [50] Firefighting in a multi-lingual world
    Anon
    [J]. Fire International, 2002, (194):