Arabic document layout analysis

被引:0
|
作者
Amany M. Hesham
Mohsen A. A. Rashwan
Hassanin M. Al-Barhamtoshy
Sherif M. Abdou
Amr A. Badr
Ibrahim Farag
机构
[1] Cairo University,Department of Computer Science, Faculty of Computers and Information Technology
[2] Cairo University,Department of Electronics and Electrical Communications
[3] The Engineering Company for the Development of Computer Systems; RDI,Faculty of Computing and Information Technology
[4] King Abdulaziz University,Department of Information Technology, Faculty of Computers and Information Technology
[5] Cairo University,undefined
来源
关键词
Layout analysis; Texture features; Connected component; Clustering; Genetic algorithm; Feature selection;
D O I
暂无
中图分类号
学科分类号
摘要
Document layout analysis is a key step in the process of converting document images into text. Arabic language script is cursive and written in different styles which cause some challenges in the analysis of Arabic text documents. In this paper, we introduce an approach for Arabic documents layout analysis. In that approach, the document is segmented into set of zones using morphological operations. The segmented zones are classified as text or non-text ones using a support vector machine classifier. Features used in zone classification are combination between texture-based features and connected component-based features. The textural-based feature vector size is reduced using genetic algorithm. Classified text zones are clustered, using adaptive sample set clustering algorithm, into lines. Each segmented line is segmented into words by clustering inter- and intra-spaces. The proposed system was evaluated against two other systems that represent the best available tools for the Arabic documents analysis, and evaluation results show that the proposed system works well on multi-font and multi-size documents with a variety of layouts even on some historical documents.
引用
收藏
页码:1275 / 1287
页数:12
相关论文
共 50 条
  • [1] Arabic document layout analysis
    Hesham, Amany M.
    Rashwan, Mohsen A. A.
    Al-Barhamtoshy, Hassanin M.
    Abdou, Sherif M.
    Badr, Amr A.
    Farag, Ibrahim
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2017, 20 (04) : 1275 - 1287
  • [2] High Performance Layout Analysis of Arabic and Urdu Document Images
    Bukhari, Syed Saqib
    Shafait, Faisal
    Breuel, Thomas M.
    [J]. 11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 1275 - 1279
  • [3] Layout Analysis for Arabic Historical Document Images Using Machine Learning
    Bukhari, Syed Saqib
    Breuel, Thomas M.
    Asi, Abedelkadir
    El-Sana, Jihad
    [J]. 13TH INTERNATIONAL CONFERENCE ON FRONTIERS IN HANDWRITING RECOGNITION (ICFHR 2012), 2012, : 639 - 644
  • [4] Adaptive layout analysis of document images
    Malerba, D
    Esposito, F
    Altamura, O
    [J]. FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2002, 2366 : 526 - 534
  • [5] THE DOCUMENT SPECTRUM FOR PAGE LAYOUT ANALYSIS
    OGORMAN, L
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1993, 15 (11) : 1162 - 1173
  • [6] Local Descriptors for Document Layout Analysis
    Garz, Angelika
    Diem, Markus
    Sablatnig, Robert
    [J]. ADVANCES IN VISUAL COMPUTING, PT III, 2010, 6455 : 29 - 38
  • [7] Layout analysis of urdu document images
    Shafait, Faisal
    Adnan-ul-Hasan
    Keysers, Daniel
    Breuel, Thomas M.
    [J]. 10TH IEEE INTERNATIONAL MULTITOPIC CONFERENCE 2006, PROCEEDINGS, 2006, : 293 - +
  • [8] DOCUMENT IMAGE SEGMENTATION AND LAYOUT ANALYSIS
    SAITOH, T
    YAMAAI, T
    TACHIKAWA, M
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1994, E77D (07) : 778 - 784
  • [9] Document Layout Analysis: A Comprehensive Survey
    Binmakhashen, Galal M.
    Mahmoud, Sabri A.
    [J]. ACM COMPUTING SURVEYS, 2020, 52 (06)
  • [10] Document Reconstruction by Layout Analysis of Snippets
    Kleber, Florian
    Diem, Markus
    Sablatnig, Robert
    [J]. COMPUTER VISION AND IMAGE ANALYSIS OF ART, 2010, 7531