Development and customization of in-house developed OCR and its evaluation

被引:0
|
作者
Rajeswari, S. [1 ]
Magapu, Sai Baba [2 ]
机构
[1] Indira Gandhi Ctr Atom Res, Homi Bhabha Natl Inst, Kalpakkam, Tamil Nadu, India
[2] Natl Inst Adv Studies, Dept Nat Sci & Engn, Bangalore, Karnataka, India
来源
ELECTRONIC LIBRARY | 2018年 / 36卷 / 05期
关键词
Key phrases; Key words; Optical character recognition; Skew detection and correction; Stemming; Stop words;
D O I
10.1108/EL-01-2018-0011
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Purpose The purpose of this paper is to develop a text extraction tool for scanned documents that would extract text and build the keywords corpus and key phrases corpus for the document without manual intervention. Design/methodology/approach For text extraction from scanned documents, a Web-based optical character recognition (OCR) tool was developed. OCR is a well-established technology, so to develop the OCR, Microsoft Office document imaging tools were used. To account for the commonly encountered problem of skew being introduced, a method to detect and correct the skew introduced in the scanned documents was developed and integrated with the tool. The OCR tool was customized to build keywords and key phrases corpus for every document. Findings The developed tool was evaluated using a 100 document corpus to test the various properties of OCR. The tool had above 99 per cent word read accuracy for text only image documents. The customization of the OCR was tested with samples of Microfiches, sample of Journal pages from back volumes and samples from newspaper clips and the results are discussed in the summary. The tool was found to be useful for text extraction and processing. Social implications The scanned documents are converted to keywords and key phrases corpus. The tool could be used to build metadata for scanned documents without manual intervention. Originality/value The tool is used to convert unstructured data (in the form of image documents) to structured data (the document is converted into keywords, and key phrases database). In addition, the image document is converted to editable and searchable document.
引用
收藏
页码:766 / 781
页数:16
相关论文
共 50 条
  • [41] Development of an Ephedrine In-House Matrix Reference Material and Its Application to Doping Analysis
    Kang, Inseon
    Seo, Yoondam
    Lee, Kangmi
    Lee, Hyeon-Jeong
    Son, Junghyun
    Lee, Hwa Jeong
    Oh, Min-Kyu
    Min, Hophil
    [J]. ACS OMEGA, 2024, 9 (11): : 12689 - 12697
  • [42] Development of an in-house references of house dust mite allergen vaccines
    Quintero, Mas A.
    Rosado, Labrada A.
    Ramos, Facenda E.
    Gonzalez, Ramirez W.
    Morejon, Mateo M.
    Valdes, Montesino M.
    Diaz, Oliva Y.
    Averoff, Torralba D.
    de la Vega, Castillo Lazo O.
    [J]. ALLERGY, 2008, 63 : 395 - 395
  • [43] DEVELOPMENT OF AN INDEX FOR IN-HOUSE RESEARCH AND DEVELOPMENT TECHNICAL RECORDS
    MCILVAIN, JM
    LEUM, LN
    [J]. JOURNAL OF CHEMICAL DOCUMENTATION, 1964, 4 (04): : 256 - 258
  • [44] Performance Validation of In-House Developed Four-dimensional Dynamic Phantom
    Chaudhary, Rahul Kumar
    Kumar, Rajesh
    Sharma, S. D.
    Bera, Soumen
    Mittal, Vikram
    Deshpande, Sudesh
    [J]. JOURNAL OF MEDICAL PHYSICS, 2019, 44 (02) : 99 - 105
  • [45] Supercritical pyrolysis of in-house developed endothermic fuel and estimation of coke and endothermicity
    Nalabala, Madhavaiah
    Dinda, Srikanta
    [J]. ENERGY, 2024, 289
  • [46] Sources of innovation in China's manufacturing sector: imported or developed in-house?
    Sun, YF
    [J]. ENVIRONMENT AND PLANNING A-ECONOMY AND SPACE, 2002, 34 (06): : 1059 - 1072
  • [47] A Research on an In-house Training Software and its Application
    Wang, Weiping
    [J]. 2015 7TH INTERNATIONAL CONFERENCE ON EMERGING TRENDS IN ENGINEERING & TECHNOLOGY (ICETET), 2015, : 165 - 168
  • [48] Efficacy of and Satisfaction with an In-house Developed Natural Rubber Cardiopulmonary Resuscitation Manikin
    Anuntaseree, Sittichoke
    Kalkornsurapranee, Ekwipoo
    Yuenyongviwat, Varah
    [J]. WESTERN JOURNAL OF EMERGENCY MEDICINE, 2020, 21 (01) : 91 - 95
  • [49] CHARACTERIZATION OF AN IN-HOUSE DEVELOPED MULTI-CYLINDRICAL MODERATOR NEUTRON SPECTROMETER
    Liamsuwan, T.
    Channuie, J.
    Wonglee, S.
    Kowatari, M.
    Nishino, S.
    [J]. RADIATION PROTECTION DOSIMETRY, 2018, 180 (1-4) : 94 - 97
  • [50] Waste-to-Energy Systems Modelling Using In-House Developed Software
    Kropac, Jiri
    Pavlas, Martin
    Fusek, Michal
    Klimek, Petr
    Tous, Michal
    [J]. PRES 2011: 14TH INTERNATIONAL CONFERENCE ON PROCESS INTEGRATION, MODELLING AND OPTIMISATION FOR ENERGY SAVING AND POLLUTION REDUCTION, PTS 1 AND 2, 2011, 25 : 533 - +