CMATERdb1: a database of unconstrained handwritten Bangla and Bangla–English mixed script document image

被引:0
|
作者
Ram Sarkar
Nibaran Das
Subhadip Basu
Mahantapas Kundu
Mita Nasipuri
Dipak Kumar Basu
机构
[1] Jadavpur University,Computer Science and Engineering Department
[2] Jadavpur University,A.I.C.T.E. Emeritus Fellow, Computer Science and Engineering Department
关键词
Unconstrained handwritten document image database; Text line extraction; Ground truth preparation; OCR of multi-script document;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper, we have described the preparation of a benchmark database for research on off-line Optical Character Recognition (OCR) of document images of handwritten Bangla text and Bangla text mixed with English words. This is the first handwritten database in this area, as mentioned above, available as an open source document. As India is a multi-lingual country and has a colonial past, so multi-script document pages are very much common. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and rests of the 50 pages are written in Bangla text mixed with English words. This database for off-line-handwritten scripts is collected from different data sources. After collecting the document pages, all the documents have been preprocessed and distributed into two groups, i.e., CMATERdb1.1.1, containing document pages written in Bangla script only, and CMATERdb1.2.1, containing document pages written in Bangla text mixed with English words. Finally, we have also provided the useful ground truth images for the line segmentation purpose. To generate the ground truth images, we have first labeled each line in a document page automatically by applying one of our previously developed line extraction techniques [Khandelwal et al., PReMI 2009, pp. 369–374] and then corrected any possible error by using our developed tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two databases, respectively, using our algorithm. Both the databases along with the ground truth annotations and the ground truth generating tool are available freely at http://code.google.com/p/cmaterdb.
引用
收藏
页码:71 / 83
页数:12
相关论文
共 10 条
  • [1] CMATERdb1: a database of unconstrained handwritten Bangla and Bangla-English mixed script document image
    Sarkar, Ram
    Das, Nibaran
    Basu, Subhadip
    Kundu, Mahantapas
    Nasipuri, Mita
    Basu, Dipak Kumar
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2012, 15 (01) : 71 - 83
  • [2] An Efficient Line Segmentation Approach for Handwritten Bangla Document Image
    Mullick, K.
    Banerjee, S.
    Bhattacharya, U.
    [J]. 2015 EIGHTH INTERNATIONAL CONFERENCE ON ADVANCES IN PATTERN RECOGNITION (ICAPR), 2015, : 130 - +
  • [3] A benchmark image database of isolated Bangla handwritten compound characters
    Nibaran Das
    Kallol Acharya
    Ram Sarkar
    Subhadip Basu
    Mahantapas Kundu
    Mita Nasipuri
    [J]. International Journal on Document Analysis and Recognition (IJDAR), 2014, 17 : 413 - 431
  • [4] A benchmark image database of isolated Bangla handwritten compound characters
    Das, Nibaran
    Acharya, Kallol
    Sarkar, Ram
    Basu, Subhadip
    Kundu, Mahantapas
    Nasipuri, Mita
    [J]. INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2014, 17 (04) : 413 - 431
  • [5] Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images
    Singh, Pawan Kumar
    Sarkar, Ram
    Das, Nibaran
    Basu, Subhadip
    Kundu, Mahantapas
    Nasipuri, Mita
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (07) : 8441 - 8473
  • [6] Document Image Analysis for a Major Indic Script Bangla - Advancement and Scope
    Roy, Kaushik
    [J]. RECENT TRENDS IN IMAGE PROCESSING AND PATTERN RECOGNITION (RTIP2R 2016), 2017, 709 : 125 - 134
  • [7] Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images
    Pawan Kumar Singh
    Ram Sarkar
    Nibaran Das
    Subhadip Basu
    Mahantapas Kundu
    Mita Nasipuri
    [J]. Multimedia Tools and Applications, 2018, 77 : 8441 - 8473
  • [8] Word Extraction and Character Segmentation from Text Lines of Unconstrained Handwritten Bangla Document Images
    Sarkar, Ram
    Malakar, Samir
    Das, Nibaran
    Basu, Subhadip
    Kundu, Mahantapas
    Nasipuri, Mita
    [J]. JOURNAL OF INTELLIGENT SYSTEMS, 2011, 20 (03) : 227 - 260
  • [9] An image database of handwritten Bangla words with automatic benchmarking facilities for character segmentation algorithms
    Samir Malakar
    Ram Sarkar
    Subhadip Basu
    Mahantapas Kundu
    Mita Nasipuri
    [J]. Neural Computing and Applications, 2021, 33 : 449 - 468
  • [10] An image database of handwritten Bangla words with automatic benchmarking facilities for character segmentation algorithms
    Malakar, Samir
    Sarkar, Ram
    Basu, Subhadip
    Kundu, Mahantapas
    Nasipuri, Mita
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (01): : 449 - 468