Language Identification from an Indian Multilingual Document Using Profile Features

被引:5
|
作者
Padma, M. C. [1 ]
Vijaya, P. A. [2 ]
Nagabhushan, P. [3 ]
机构
[1] PES Coll Engn, Dept CS & Engg, Mandya 571401, Karnataka, India
[2] Malnad Coll Engn, Dept E & C Engn, Hassan, Karnataka 573201, India
[3] Univ Mysore, Dept Studies, Mysore, Karnataka, India
关键词
Document Image Processing; Multi-lingual document; Language Identification; Top Profile; Bottom Profile; Feature extraction; SCRIPT IDENTIFICATION;
D O I
10.1109/ICCAE.2009.35
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In order to reach a larger cross section of people, it is necessary that a document should be composed of text contents in different languages. But on the other hand, this causes practical difficulty in OCRing such a document, because the language type of the text should be pre-determined, before employing a particular OCR. In this research work, this problem of recognizing the language of the text content is addressed, however it is perhaps impossible to design a single recognizer which can identify a large number of scripts/languages. As a via media, in this research we have proposed to work on the prioritized requirements of a particular region, for instance in Karnataka state in India, generally any document including official ones, would contain the text in three languages-English-the language of general importance, Hindi-the language of National importance and Kannada-the language of State/Regional importance. We have proposed to learn identifying the language of the text by thoroughly understanding the nature of top and bottom profiles of the printed text lines in these three languages. Experimentation conducted involved 800 text lines for learning and 600 text lines for testing. The performance has turned out to be 95.4%.
引用
收藏
页码:332 / +
页数:2
相关论文
共 50 条
  • [1] Automatic Language Identification and Content Separation from Indian Multilingual Documents Using Unicode Transformation Format
    Rakholia, Rajnish M.
    Saini, Jatinderkumar R.
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 1, 2017, 468 : 369 - 378
  • [2] Text Line Identification from a Multilingual Document
    Vijaya, P. A.
    Padma, M. C.
    ICDIP 2009: INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING, PROCEEDINGS, 2009, : 302 - +
  • [3] A Review on Multilingual Document Analysis in Indian Context
    Manjula, S.
    Hegadi, Ravindra S.
    PROCEEDINGS OF THE 2016 2ND INTERNATIONAL CONFERENCE ON APPLIED AND THEORETICAL COMPUTING AND COMMUNICATION TECHNOLOGY (ICATCCT), 2016, : 519 - 522
  • [4] Sparse Representation based Language Identification using Prosodic Features for Indian Languages
    Singh, Om Prakash
    Haris, B. C.
    Sinha, Rohit
    Chettri, Bhusan
    Pradhan, Abhishek
    2013 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2013,
  • [5] Automatic Language Identification for Seven Indian Languages using Higher Level Features
    Madhu, Chithra
    George, Anu
    Mary, Leena
    2017 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2017,
  • [6] Spoken Indian language identification: a review of features and databases
    Aarti, Bakshi
    Kopparapu, Sunil Kumar
    SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2018, 43 (04):
  • [7] Spoken Indian language identification: a review of features and databases
    BAKSHI AARTI
    SUNIL KUMAR KOPPARAPU
    Sādhanā, 2018, 43
  • [8] Multilingual native language identification
    Malmasi, Shervin
    Dras, Mark
    NATURAL LANGUAGE ENGINEERING, 2017, 23 (02) : 163 - 215
  • [9] Malay Language Document Identification Using BPNN
    Noh, Norzaidah Md
    Talib, Mohd Rusydi Abdul
    Ahmad, Azlin
    Halim, Shamimi A.
    Mohamed, Azlinah
    NN'09: PROCEEDINGS OF THE 10TH WSEAS INTERNATIONAL CONFERENCE ON NEURAL NETWORKS: PROCEEDINGS OF THE 10TH WSEAS INTERNATIONAL CONFERENCE ON NEURAL NETWORKS (NN'09), 2009, : 163 - +
  • [10] Indian language identification using time-frequency texture features and kernel ELM
    Birajdar, Gajanan K.
    Raveendran, Smitha
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2022, 14 (10) : 13237 - 13250