A Fast Appearance-Based Full-Text Search Method for Historical Newspaper Images

被引:5
|
作者
Terasawa, Kengo [1 ]
Shima, Takahiro [1 ]
Kawashima, Toshio [1 ]
机构
[1] Future Univ Hakodate, Grad Sch Syst Informat Sci, Hakodate, Hokkaido 0418655, Japan
关键词
string matching; word spotting; historical document images; Locality-Sensitive Pseudo-Code; Boyer-Moore-Horspool algorithm;
D O I
10.1109/ICDAR.2011.277
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a fast appearance-based full-text search method for historical newspaper images. Since historical newspapers differ from recent newspapers in image quality, type fonts and language usages, optical character recognition (OCR) does not provide sufficient quality. Instead of OCR approach, we adopted appearance-based approach; that means we matched character to character with its shapes. Assuming proper character segmentation and proper feature description, full-text search problem is reduced to sequence matching problem or feature vector. To increase computational efficiency, we adopted pseudo-code expression called LSPC, which is a compact sketch of feature vector while retaining a good deal of its information. Experimental result showed that our method can retrieve a query string from a text of over eight million characters within a second. In addition, we predict that more sophisticated algorithm could be designed for LSPC. As an example, we established the Extended Boyer-Moore-Hors pool algorithm that can reduce the computational cost further especially when the query string becomes longer.
引用
收藏
页码:1379 / 1383
页数:5
相关论文
共 50 条
  • [1] A method to improve full-text search performance of MongoDB
    Mesut, Altan
    Ozturk, Emir
    PAMUKKALE UNIVERSITY JOURNAL OF ENGINEERING SCIENCES-PAMUKKALE UNIVERSITESI MUHENDISLIK BILIMLERI DERGISI, 2022, 28 (05): : 720 - 729
  • [2] VU TEXT - FULL-TEXT DAILY NEWSPAPER INFORMATION ... AND MORE
    MCCLEARY, H
    ONLINE, 1985, 9 (04): : 87 - 94
  • [3] Semantic Full-text Search with Broccoli
    Bast, Hannah
    Baurle, Florian
    Buchhold, Bjoern
    Haussmann, Elmar
    SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 1265 - 1266
  • [4] A Method of Full-text Retrieval Based on Lucene
    Chen, Xiangrong
    Sun, Yong
    Ge, Xiaopei
    Wang, Congwei
    PROCEEDINGS OF 2009 INTERNATIONAL CONFERENCE ON INFORMATION, ELECTRONIC AND COMPUTER SCIENCE, VOLS I AND II, 2009, : 217 - 220
  • [5] Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines
    Mu, Cun
    Zhao, Jun
    Yang, Guang
    Yang, Binwei
    Yan, Zheng
    SIMILARITY SEARCH AND APPLICATIONS (SISAP 2019), 2019, 11807 : 49 - 56
  • [6] Towards a Full-Text Historical Digital Library
    Allen, Robert B.
    Chu, Yoonmi
    EMERGENCE OF DIGITAL LIBRARIES - RESEARCH AND PRACTICES, 2014, 8839 : 218 - 226
  • [7] Towards a full-text historical digital library
    Allen, Robert B.
    Chu, Yoonmi
    Allen, Robert B., 1600, Springer Verlag (8839): : 218 - 226
  • [8] Full-text Search Using Database Index
    Chaitanya, B. Sri Sai Krishna
    Reddy, D. Ajay Kumar
    Chandra, B. Pavan Sai Eshwar
    Krishna, A. Bala
    Menon, Remya R. K.
    2019 5TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2019,
  • [9] Preparing heterogeneous XML for full-text search
    Lehtonen, Miro
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2006, 24 (04) : 455 - 474
  • [10] Generation of Synthetic Images of Full-Text Documents
    Bures, Lukas
    Neduchal, Petr
    Hlavac, Miroslav
    Hruz, Marek
    SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 68 - 75