Hierarchical content classification and script determination for automatic document image processing

被引:9
|
作者
Chi, Z [1 ]
Wang, Q
Siu, WC
机构
[1] Hong Kong Polytech Univ, Ctr Multimedia Signal Proc, Dept Elect & Informat Engn, Hong Kong, Hong Kong, Peoples R China
[2] Northwestern Polytech Univ, Dept Comp Sci & Engn, Xian 710072, Peoples R China
关键词
document image processing; page segmentation; content classification; script determination; background thinning; cross-correlation; Kolmogorov complexity; neural networks;
D O I
10.1016/S0031-3203(03)00128-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Page segmentation and image content classification play an important role in automatic image processing with applications to mixed-type document image compression, form and check reading, and automatic mail sorting. In this paper, we first present an enhanced background thinning based approach for fast page segmentation. After the analysis of three different methods individually, a hierarchical approach for document content classification is proposed, which classifies a sub-image into one of two categories: text and halftone. Our approach combines a neural network model, cross-correlation metric, and Kolmogorov complexity measure in a hierarchical structure. Considering the necessity of a recognition system, we also propose using a three-layer feedforward neural network to classify text regions into Chinese and English scripts. The classification accuracy on a number of document images reaches 100% and 97.1% for halftone region and text region, respectively. Meanwhile, the system can achieve a correct rate of 92.3% and 95.0% for Chinese and alphabetic script determination, respectively. (C) 2003 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
引用
收藏
页码:2483 / 2500
页数:18
相关论文
共 50 条
  • [41] THEORY OF RELEVANCE FOR AUTOMATIC DOCUMENT CLASSIFICATION
    HEAPS, HS
    [J]. INFORMATION AND CONTROL, 1973, 22 (03): : 268 - 278
  • [42] THE USE OF TITLES FOR AUTOMATIC DOCUMENT CLASSIFICATION
    HAMILL, KA
    ZAMORA, A
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1980, 31 (06): : 396 - 402
  • [43] Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image
    Hiremath, P. S.
    Shivashankar, S.
    [J]. PATTERN RECOGNITION LETTERS, 2008, 29 (09) : 1182 - 1189
  • [44] Automatic document classification of biological literature
    Chen, David
    Muller, Hans-Michael
    Sternberg, Paul W.
    [J]. BMC BIOINFORMATICS, 2006, 7 (1)
  • [45] Automatic document classification of biological literature
    David Chen
    Hans-Michael Müller
    Paul W Sternberg
    [J]. BMC Bioinformatics, 7
  • [46] Script identification in a handwritten document image using texture features
    Hiremath, P. S.
    Shivashankar, S.
    Pujari, Jagdeesh D.
    Mouneswara, V.
    [J]. 2010 IEEE 2ND INTERNATIONAL ADVANCE COMPUTING CONFERENCE, 2010, : 110 - +
  • [47] DOCUMENT PROCESSING FOR AUTOMATIC KNOWLEDGE ACQUISITION
    TANG, YY
    YAN, CD
    SUEN, CY
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1994, 6 (01) : 3 - 21
  • [48] Text segmentation for automatic document processing
    Mital, DP
    Leng, GW
    [J]. ETFA '96 - 1996 IEEE CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION, PROCEEDINGS, VOLS 1 AND 2, 1996, : 642 - 648
  • [49] Test segmentation for automatic document processing
    Mital, DP
    Leng, GW
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL VI, 1999, 3651 : 30 - 40
  • [50] An Approach for Automatic Indic Script Identification from Handwritten Document Images
    Obaidullah, Sk. Md.
    Halder, Chayan
    Das, Nibaran
    Roy, Kaushik
    [J]. ADVANCED COMPUTING AND SYSTEMS FOR SECURITY, VOL 2, 2016, 396 : 37 - 51