An algorithm of line segmentation and reading order sorting based on adjacent character detection: A post-processing of OCR for digitization of Chinese historical texts

被引:2
|
作者
Lee, Aram [1 ]
Yu, Hongyeon [1 ]
Min, Gihyeon [1 ]
机构
[1] Honam Res Ctr, Elect & Telecommun Res Inst, Gwangju 61012, South Korea
关键词
Chinese historical text; Line segmentation; Reading order sorting; Optical character recognition; Digital text conversion; Cultural heritage conservation;
D O I
10.1016/j.culher.2024.02.001
中图分类号
K85 [文物考古];
学科分类号
0601 ;
摘要
In recent times, the advent of AI-based optical character recognition (OCR) has garnered significant attention in the realm of digital text conversion. However, it is imperative to note that OCR solely identifies individual characters or words, and lacks the ability to reunite them into cohesive units such as words or sentences. Consequently, the manual sorting of them to establish the appropriate reading order has emerged as a bottleneck. In this paper, we present an algorithm termed adjacent character detection (ACD), designed to serve as a post-processing of OCR, facilitating automatic digital text conversion. The algorithm involves line segmentation through a quad-ACD scan (up-down-down-up), allowing it to consecutively discern characters within a column based on their adjacency relations. Conventional projection profile analyses have struggled to effectively partition the distinct internal structure of Chinese historical text, where two annotation columns often subdivide from a single body column. In contrast, our ACD algorithm employs an approach, reuniting adjacent characters rather than fragmenting the entire text into isolated entities. Additionally, ACD algorithm enabled body/annotation classification for OCR-detected characters based on the pattern analysis of its quad scan. This cumulative information empowers the conversion of digital text in a desired reading order. To assess the efficacy of the proposed algorithm, a set of ground-truth OCR result was subjected to rigorous testing, culminating in a reading order accuracy of 98.6%. Noteworthy robustness was also demonstrated in the face of misaligned columns, experimentally induced by applying tilt, warp, and wavy noises to the original digital images. Lastly, the algorithm was integrated with two pre-developed OCR models, resulting in a reading order accuracy of 97.7%. (c) 2024 Consiglio Nazionale delle Ricerche (CNR). Published by Elsevier Masson SAS. All rights reserved.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 2 条
  • [1] Research on Chinese character recognition post-processing based on genetic algorithm
    Wang, KJ
    Tian, XD
    Guo, BL
    2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 1718 - 1721
  • [2] Reading between the Lines: Image-Based Order Detection in OCR for Chinese Historical Documents
    Ma, Hsing-Yuan
    Huang, Hen-Hsen
    Liu, Chao-Lin
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23808 - 23810