The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels

被引:17
|
作者
Drinkwater, Robyn E. [1 ]
Cubey, Robert W. N. [1 ]
Haston, Elspeth M. [1 ]
机构
[1] Royal Bot Garden Edinburgh, Edinburgh EH3 5LR, Midlothian, Scotland
基金
美国安德鲁·梅隆基金会;
关键词
OCR; Digitisation; Data entry; Specimen; Label; Herbarium; BIOLOGICAL COLLECTIONS; WORKFLOWS;
D O I
10.3897/phytokeys.38.7168
中图分类号
Q94 [植物学];
学科分类号
071001 ;
摘要
At the Royal Botanic Garden Edinburgh (RBGE) the use of Optical Character Recognition (OCR) to aid the digitisation process has been investigated. This was tested using a herbarium specimen digitisation process with two stages of data entry. Records were initially batch-processed to add data extracted from the OCR text prior to being sorted based on Collector and/or Country. Using images of the specimens, a team of six digitisers then added data to the specimen records. To investigate whether the data from OCR aid the digitisation process, they completed a series of trials which compared the efficiency of data entry between sorted and unsorted batches of specimens. A survey was carried out to explore the opinion of the digitisation staff to the different sorting options. In total 7,200 specimens were processed. When compared to an unsorted, random set of specimens, those which were sorted based on data added from the OCR were quicker to digitise. Of the methods tested here, the most successful in terms of efficiency used a protocol which required entering data into a limited set of fields and where the records were filtered by Collector and Country. The survey and subsequent discussions with the digitisation staff highlighted their preference for working with sorted specimens, in which label layout, locations and handwriting are likely to be similar, and so a familiarity with the Collector or Country is rapidly established.
引用
收藏
页码:15 / 30
页数:16
相关论文
共 50 条
  • [21] Path Planning Based on an Artificial Vision System and Optical Character Recognition (OCR)
    Sanchez Diaz, Julio Ernesto
    Martin Rudolf, Thomas
    ICCCV 2019: PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON CONTROL AND COMPUTER VISION, 2019, : 33 - 37
  • [22] The use of eigenpictures for optical character recognition
    Muller, N
    Herbst, B
    FOURTEENTH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1 AND 2, 1998, : 1124 - 1126
  • [23] USE OF OCR (OPTICAL CHARACTER READING) IN LABORATORY DATA-ENTRY
    LARSEN, ML
    PECKHAM, NH
    CLINICAL CHEMISTRY, 1976, 22 (07) : 1160 - 1160
  • [24] MaBaybay-OCR: A Matlab-based Baybayin optical character recognition package
    Pino, Rodney
    Mendoza, Renier
    Sambayan, Rachelle
    SOFTWAREX, 2025, 29
  • [25] Review on OCR for Handwritten Indian Scripts Character Recognition
    Kumar, Munish
    Jindal, M. K.
    Sharma, R. K.
    ADVANCES IN DIGITAL IMAGE PROCESSING AND INFORMATION TECHNOLOGY, 2011, 205 : 268 - +
  • [26] OPTICAL CHARACTER READING (OCR) OF COMPLEX DICTIONARY TEXTS
    NORLINGCHRISTENSEN, O
    SYMPOSIUM ON LEXICOGRAPHY IV, 1988, 26 : 219 - 232
  • [27] Implementation of an Optical Character Reader (OCR) for Bengali Language
    Chowdhury, Muhammed Tawfiq
    Islam, Md Saiful
    Bipul, Baijed Hossain
    Rhaman, Md Khalilur
    2015 INTERNATIONAL CONFERENCE ON DATA AND SOFTWARE ENGINEERING (ICODSE), 2015, : 126 - 131
  • [28] Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR)
    Cheema, Musa Dildar Ahmed
    Shaiq, Mohammad Daniyal
    Mirza, Farhaan
    Kamal, Ali
    Naeem, M. Asif
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [29] OCR-MRD: performance analysis of different optical character recognition engines for medical report digitization
    Batra P.
    Phalnikar N.
    Kurmi D.
    Tembhurne J.
    Sahare P.
    Diwan T.
    International Journal of Information Technology, 2024, 16 (1) : 447 - 455
  • [30] Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR)
    Cheema M.D.A.
    Shaiq M.D.
    Mirza F.
    Kamal A.
    Naeem M.A.
    PeerJ Computer Science, 2024, 10 : 1 - 24