Development and customization of in-house developed OCR and its evaluation

被引:0
|
作者
Rajeswari, S. [1 ]
Magapu, Sai Baba [2 ]
机构
[1] Indira Gandhi Ctr Atom Res, Homi Bhabha Natl Inst, Kalpakkam, Tamil Nadu, India
[2] Natl Inst Adv Studies, Dept Nat Sci & Engn, Bangalore, Karnataka, India
来源
ELECTRONIC LIBRARY | 2018年 / 36卷 / 05期
关键词
Key phrases; Key words; Optical character recognition; Skew detection and correction; Stemming; Stop words;
D O I
10.1108/EL-01-2018-0011
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Purpose The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention. Design/methodology/approach For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document. Findings The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing. Social implications The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention. Originality/value The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.
引用
收藏
页码:766 / 781
页数:16
相关论文
共 50 条
  • [31] SELECTION AND IN-HOUSE DEVELOPMENT OF SYSTEMS ENGINEERS
    MARDON, J
    [J]. TRAINING AND DEVELOPMENT JOURNAL, 1967, 21 (09): : 87 - 98
  • [32] Development of In-House Tissue Equivalent Bolus
    Shanmugam, S.
    [J]. MEDICAL PHYSICS, 2018, 45 (06) : E320 - E320
  • [33] AUTOMOTIVE - OUTSIDE SUPPLIERS OR IN-HOUSE DEVELOPMENT
    JURGEN, RK
    [J]. IEEE SPECTRUM, 1991, 28 (06) : 34 - 36
  • [34] DEVELOPMENT SYSTEMS ADD IN-HOUSE VOICE
    LINEBACK, JR
    [J]. ELECTRONICS-US, 1981, 54 (23): : 38 - 39
  • [35] Performance evaluation of in-house developed Covid-19 IgG/IgM antibody rapid diagnostic kit
    Sagar, G. Vinaya Chandu Vidya
    Reddy, P. V. Janardhan
    Suravajhala, Prashanth
    Suravajhala, Renuka
    Kiran, V. Uday
    Kishor, P. B. Kavi
    Venkateswarulu, T. C.
    Polavarapu, Rathnagiri
    [J]. AMB EXPRESS, 2023, 13 (01)
  • [36] In-house Software Development: Considerations for Implementation
    Jackson, Scott
    Brannon, Sian
    [J]. JOURNAL OF ACADEMIC LIBRARIANSHIP, 2018, 44 (06): : 689 - 691
  • [37] IN-HOUSE TRAINING AND STAFF-DEVELOPMENT
    TENOPIR, C
    [J]. LIBRARY JOURNAL, 1984, 109 (08) : 870 - 871
  • [38] Performance evaluation of in-house developed Covid-19 IgG/IgM antibody rapid diagnostic kit
    Vinaya Chandu Vidya Sagar G
    PV Janardhan Reddy
    Prashanth Suravajhala
    Renuka Suravajhala
    Uday Kiran V
    Kavi Kishor PB
    Venkateswarulu TC
    Rathnagiri Polavarapu
    [J]. AMB Express, 13
  • [39] ADVANTAGES OF IN-HOUSE SOFTWARE-DEVELOPMENT
    ROSE, M
    [J]. AMERICAN LABORATORY, 1983, 15 (11) : 96 - &
  • [40] Development of In-house Unidirectional Carbon/epoxy Prepregs and its Characterization for Aerospace Applications
    Mohan, P. R. Krishna
    Kumar, Anil M.
    Kumar, Shiva Goutham
    Mohite, P. M.
    [J]. 2ND INTERNATIONAL CONFERENCE ON STRUCTURAL INTEGRITY AND EXHIBITION 2018 (SICE 2018), 2019, 14 : 176 - 183