Development and customization of in-house developed OCR and its evaluation

被引:0
|
作者
Rajeswari, S. [1 ]
Magapu, Sai Baba [2 ]
机构
[1] Indira Gandhi Ctr Atom Res, Homi Bhabha Natl Inst, Kalpakkam, Tamil Nadu, India
[2] Natl Inst Adv Studies, Dept Nat Sci & Engn, Bangalore, Karnataka, India
来源
ELECTRONIC LIBRARY | 2018年 / 36卷 / 05期
关键词
Key phrases; Key words; Optical character recognition; Skew detection and correction; Stemming; Stop words;
D O I
10.1108/EL-01-2018-0011
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Purpose The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention. Design/methodology/approach For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document. Findings The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing. Social implications The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention. Originality/value The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.
引用
收藏
页码:766 / 781
页数:16
相关论文
共 50 条
  • [1] Customization of an in-house XAFS spectrometer for sulfur measurement
    Taguchi, T.
    Shinoda, K.
    Tohji, K.
    [J]. PHYSICA SCRIPTA, 2005, T115 : 1017 - 1018
  • [2] Radiology residency eCurriculurn developed in-house: Evaluation of benefits and weaknesses
    Bartlett, ES
    Maley, JE
    Fajardo, LL
    [J]. ACADEMIC RADIOLOGY, 2003, 10 (06) : 657 - 663
  • [3] An in-house developed Timescale for NavIC PTF
    Arora, Anu
    Dakkumalla, Suresh
    Bhardwajan, Aakanksha Avnish
    Sadasivan, Rajath
    Maharana, Shikha
    Ganesh, Subramanya T.
    Ramakrishna, B. N.
    [J]. 2019 EUROPEAN NAVIGATION CONFERENCE (ENC), 2019,
  • [4] In-house development of μ-ECM setup and its experimental validation
    Painuly, Madhusudan
    Singh, Ravi Pratap
    Trehan, Rajeev
    [J]. ADVANCES IN MATERIALS AND PROCESSING TECHNOLOGIES, 2024, 10 (01) : 75 - 88
  • [5] In-House Hybrid Technique for Customization of Guides and Miniplates in Orthognathic Surgery
    Paggi Claus, Jonathas Daniel
    Almeida, Matheus Spinella
    Hernandez-Alfaro, Federico
    [J]. JOURNAL OF CRANIOFACIAL SURGERY, 2020, 31 (04) : 1122 - 1124
  • [6] Implementation Issues and Challenges with PKI Infrastructure and its Integration with in-house Developed IT Applications
    Jain, Alok
    Khare, Gitika
    Rajan, Alpana
    Manjhi, Nirmala
    Pathy, Diptikant
    Rawat, Anil
    [J]. 2014 CONFERENCE ON IT IN BUSINESS, INDUSTRY AND GOVERNMENT (CSIBIG), 2014,
  • [7] Nanostructured scintillator developed in-house for radon detection
    Abdalla, Ayman M.
    Almalki, Shaimaa
    Kawaguchi, Noriaki
    Yanagida, Takayuki
    [J]. RADIATION PHYSICS AND CHEMISTRY, 2022, 197
  • [8] INNOVATIVE IN-HOUSE DEVELOPMENT
    JOYCE, EJ
    [J]. DATAMATION, 1988, 34 (20): : 81 - 83
  • [9] An in-house developed resettable MOSFET dosimeter for radiotherapy
    Verellen, Dirk
    Van Vaerenbergh, Sven
    Tournel, Koen
    Heuninckx, Karina
    Joris, Laurent
    Duchateau, Michael
    Linthout, Nadine
    Gevaert, Thierry
    Reynders, Truus
    Van de Vondel, Iwein
    Coppens, Luc
    Depuydt, Tom
    De Ridder, Mark
    Storme, Guy
    [J]. PHYSICS IN MEDICINE AND BIOLOGY, 2010, 55 (04): : N97 - N109
  • [10] AN IN-HOUSE DEVELOPED RESETTABLE MOSFET DOSIMETER FOR RADIOTHERAPY
    Van Vaerenbergh, S.
    Verellen, D.
    Tournel, K.
    Heuninckx, K.
    Duchateau, M.
    Linthout, N.
    Gevaert, T.
    Reynders, T.
    Van de Vondel, I.
    Coppens, L.
    Depuydt, T.
    Coppens, L.
    Storme, G.
    [J]. RADIOTHERAPY AND ONCOLOGY, 2010, 96 : S468 - S468