Development and customization of in-house developed OCR and its evaluation

被引:0
|
作者
Rajeswari, S. [1 ]
Magapu, Sai Baba [2 ]
机构
[1] Indira Gandhi Ctr Atom Res, Homi Bhabha Natl Inst, Kalpakkam, Tamil Nadu, India
[2] Natl Inst Adv Studies, Dept Nat Sci & Engn, Bangalore, Karnataka, India
来源
ELECTRONIC LIBRARY | 2018年 / 36卷 / 05期
关键词
Key phrases; Key words; Optical character recognition; Skew detection and correction; Stemming; Stop words;
D O I
10.1108/EL-01-2018-0011
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Purpose The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention. Design/methodology/approach For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document. Findings The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing. Social implications The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention. Originality/value The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.
引用
收藏
页码:766 / 781
页数:16
相关论文
共 50 条
  • [21] DEVELOPMENT AND EVALUATION OF AN IN-HOUSE MULTIMEDIA DESK-TOP CONFERENCE SYSTEM
    SAKATA, S
    [J]. IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 1990, 8 (03) : 340 - 347
  • [22] DEVELOPMENT AND EVALUATION OF AN IN-HOUSE MULTIMEDIA DESK-TOP CONFERENCE SYSTEM
    SAKATA, S
    UEDA, T
    [J]. NEC RESEARCH & DEVELOPMENT, 1990, (98): : 107 - 117
  • [23] Teaching solar energy applications using in-house developed testbench
    Chuku, AU
    Oni, B
    Kuate, F
    Overton, E
    [J]. Proceedings of the Thirty-Seventh Southeastern Symposium on System Theory, 2005, : 346 - 351
  • [24] OSIRIS, an Entirely in-House Developed Drug Discovery Informatics System
    Sander, Thomas
    Freyss, Joel
    von Korff, Modest
    Reich, Jacqueline Renee
    Rufener, Christian
    [J]. JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2009, 49 (02) : 232 - 246
  • [25] MAMMOGRAPHY DOSIMETRY USING AN IN-HOUSE DEVELOPED POLYMETHYL METHACRYLATE PHANTOM
    Sharma, Reena
    Sharma, Sunil Dutt
    Mayya, Y. S.
    Chourasiya, G.
    [J]. RADIATION PROTECTION DOSIMETRY, 2012, 151 (02) : 379 - 385
  • [26] PATIENT JOURNEY AUDITS USING AN AUDIT TOOL DEVELOPED IN-HOUSE
    Pathmaraj, Kunthi
    [J]. INTERNAL MEDICINE JOURNAL, 2023, 53 : 21 - 21
  • [27] Dosimetric characteristics of an in-house developed collimator for preclinical minibeam radiotherapy
    Akbas, C. Koksal
    Broggi, S.
    Cozzarini, C.
    Di Muzio, N.
    Cavaliere, F.
    Milani, P.
    Del Vecchio, A.
    Fiorino, C.
    Tacchetti, C.
    Spinelli, A.
    [J]. RADIOTHERAPY AND ONCOLOGY, 2023, 182 : S1507 - S1508
  • [28] IN-HOUSE TRAINING AND DEVELOPMENT PROGRAMS - PEDOLSKY,A
    HILDESHEIM, PMA
    [J]. CANADIAN LIBRARY JOURNAL, 1982, 39 (06): : 400 - 400
  • [29] DEVELOPMENT OF IN-HOUSE LASER TRAINING COURSE
    BRITTON, S
    [J]. LASERS IN SURGERY AND MEDICINE, 1986, 6 (02) : 222 - 222
  • [30] Development of an in-house VMAT treatment planning
    Dutschler, Alisha
    Duetschler, A.
    Mueller, S.
    Manser, P.
    Aebersold, D. M.
    Stampanoni, M. F. M.
    Fix, M. K.
    [J]. STRAHLENTHERAPIE UND ONKOLOGIE, 2019, 195 (12) : 1129 - 1130