Digitization Project for Caribbean Plants at New York Botanical Garden
The Caribbean Plants Digitization Project at New York Botanical Garden aims to image and catalog over 150,000 specimens from the Caribbean region. Through OCR and data parsing, the project focuses on curation, barcoding, cataloging, and imaging of specimens. Expeditions since 1895 have contributed to the extensive collections, with a vast amount of specimen data being digitized for accessibility. Sample ideal and actual fieldbooks are showcased, with insights on using OCR to attach fieldbook records.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical Garden The New York Botanical Garden *Legend: estimated number of specimens per country Presented by: Stephen Gottschalk
NYBGs Caribbean Collections More than 100 expeditions sponsored by the garden since 1895. Notable and prolific collections by current and former Garden staff including the Garden s founder, Nathaniel Lord Britton Approximately 75 % of the specimen data could be digitized from field books at NYBG and other institutions, or from published itineraries which provide the same information The New York Botanical Garden
Caribbean Project workflow summary: Curation and rapid barcoding of specimens Field book entries Optical Character Recognition (OCR) and data parsing Specimen Catalog Record Specimen imaging Manual keying of specimen data The New York Botanical Garden
Sample ideal fieldbook: Determination Plant family Collection locality Collection date No. of duplicates Collection no. Plant description Habitat The New York Botanical Garden
Sample fieldbook - the product: The New York Botanical Garden
Sample Caribbean fieldbooks, less than ideal: Vol 132, J. A. Safer, 1909 Vol. 69, Van Hermann, 1904 The New York Botanical Garden
OCR assists in attaching fieldbook records: user input OCR derived fields IRN Fieldbook entries The New York Botanical Garden
Using OCR to populate fields: User detects pattern to update fields Python script finds line of query term Query raw OCR to extract records of a given label type SELECT * FROM OCR_all where label like "*New*Yor*Bot*Gar*Exp*Cub*"; Example: The New York Botanical Garden
Using OCR to populate fields: User detects pattern to update fields Python script finds line of query term Query raw OCR to extract records of a given label type Return line containing Col Example: The New York Botanical Garden
Using OCR to populate fields: User detects pattern to update fields Python script finds line of query term Query raw OCR to extract records of a given label type Length of string Find position of j. a. find sha Find afer Example: J. A. Shafer collections! The New York Botanical Garden
Avoid false positives: F. S. Earle no! The New York Botanical Garden
Consider pattern training and a second OCR pass: Wright Labels, 162 total, generally low quality: 100.0% 90.0% Percentage correctly OCR d 80.0% 70.0% 60.0% 50.0% "Plant " "Cubenses" 40.0% "Wrightian " 30.0% Full String 20.0% 10.0% 0.0% built-in trained once trained mult trained other trained both trained mult. updated dictionary with OCR Pattern Training Used The New York Botanical Garden
Consider pattern training and a second OCR pass: Zanoni Labels, 114 total, generally typed: 100.0% "Moscoso" 90.0% 80.0% Percentage correctly OCR d "Rafael" 70.0% 60.0% 50.0% "Zanoni" 40.0% 30.0% Full Heading: Jardin Botanico Nacional "Dr. Rafael M. Moscoso" 20.0% 10.0% stripped " . punctuation from heading: Jardin Botanico Nacional Dr Rafael M Moscoso 0.0% built-in trained once trained mult trained other trained both OCR Pattern Training Used The New York Botanical Garden
Closing thoughts: OCR plus human parsing works well with very little programming. Works well for large, self contained data sets but maybe not for partial or changing data sets automation would be helpful for addressing this. Allows for creation of digital fieldbooks (ie order by collector, collection number and place). The New York Botanical Garden
Acknowledgements National Science Foundation Barbara Thiers, Jacquelyn Kallunki, Michael Bevans, Anthony Kirchgessner, Melissa Tulig, Benito Santos, Nicole Tarnowsky, Tom Zanoni, Benjamin Saracco, Stephen Sinon, Vinson Doyle, Jessica Allen, Sarah Dutton, Lane Gibbons, Elizabeth Kiernan, Brandy Watts, Charles Zimmerman Visit the Virtual Herbarium: http://sciweb.nybg.org/science2/vii2.asp The New York Botanical Garden