Digitization Project for Caribbean Plants at New York Botanical Garden

Slide Note
Embed
Share

The Caribbean Plants Digitization Project at New York Botanical Garden aims to image and catalog over 150,000 specimens from the Caribbean region. Through OCR and data parsing, the project focuses on curation, barcoding, cataloging, and imaging of specimens. Expeditions since 1895 have contributed to the extensive collections, with a vast amount of specimen data being digitized for accessibility. Sample ideal and actual fieldbooks are showcased, with insights on using OCR to attach fieldbook records.


Uploaded on Sep 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical Garden The New York Botanical Garden *Legend: estimated number of specimens per country Presented by: Stephen Gottschalk

  2. NYBGs Caribbean Collections More than 100 expeditions sponsored by the garden since 1895. Notable and prolific collections by current and former Garden staff including the Garden s founder, Nathaniel Lord Britton Approximately 75 % of the specimen data could be digitized from field books at NYBG and other institutions, or from published itineraries which provide the same information The New York Botanical Garden

  3. Caribbean Project workflow summary: Curation and rapid barcoding of specimens Field book entries Optical Character Recognition (OCR) and data parsing Specimen Catalog Record Specimen imaging Manual keying of specimen data The New York Botanical Garden

  4. Sample ideal fieldbook: Determination Plant family Collection locality Collection date No. of duplicates Collection no. Plant description Habitat The New York Botanical Garden

  5. Sample fieldbook - the product: The New York Botanical Garden

  6. Sample Caribbean fieldbooks, less than ideal: Vol 132, J. A. Safer, 1909 Vol. 69, Van Hermann, 1904 The New York Botanical Garden

  7. OCR assists in attaching fieldbook records: user input OCR derived fields IRN Fieldbook entries The New York Botanical Garden

  8. Using OCR to populate fields: User detects pattern to update fields Python script finds line of query term Query raw OCR to extract records of a given label type SELECT * FROM OCR_all where label like "*New*Yor*Bot*Gar*Exp*Cub*"; Example: The New York Botanical Garden

  9. Using OCR to populate fields: User detects pattern to update fields Python script finds line of query term Query raw OCR to extract records of a given label type Return line containing Col Example: The New York Botanical Garden

  10. Using OCR to populate fields: User detects pattern to update fields Python script finds line of query term Query raw OCR to extract records of a given label type Length of string Find position of j. a. find sha Find afer Example: J. A. Shafer collections! The New York Botanical Garden

  11. Avoid false positives: F. S. Earle no! The New York Botanical Garden

  12. Consider pattern training and a second OCR pass: Wright Labels, 162 total, generally low quality: 100.0% 90.0% Percentage correctly OCR d 80.0% 70.0% 60.0% 50.0% "Plant " "Cubenses" 40.0% "Wrightian " 30.0% Full String 20.0% 10.0% 0.0% built-in trained once trained mult trained other trained both trained mult. updated dictionary with OCR Pattern Training Used The New York Botanical Garden

  13. Consider pattern training and a second OCR pass: Zanoni Labels, 114 total, generally typed: 100.0% "Moscoso" 90.0% 80.0% Percentage correctly OCR d "Rafael" 70.0% 60.0% 50.0% "Zanoni" 40.0% 30.0% Full Heading: Jardin Botanico Nacional "Dr. Rafael M. Moscoso" 20.0% 10.0% stripped " . punctuation from heading: Jardin Botanico Nacional Dr Rafael M Moscoso 0.0% built-in trained once trained mult trained other trained both OCR Pattern Training Used The New York Botanical Garden

  14. Closing thoughts: OCR plus human parsing works well with very little programming. Works well for large, self contained data sets but maybe not for partial or changing data sets automation would be helpful for addressing this. Allows for creation of digital fieldbooks (ie order by collector, collection number and place). The New York Botanical Garden

  15. Acknowledgements National Science Foundation Barbara Thiers, Jacquelyn Kallunki, Michael Bevans, Anthony Kirchgessner, Melissa Tulig, Benito Santos, Nicole Tarnowsky, Tom Zanoni, Benjamin Saracco, Stephen Sinon, Vinson Doyle, Jessica Allen, Sarah Dutton, Lane Gibbons, Elizabeth Kiernan, Brandy Watts, Charles Zimmerman Visit the Virtual Herbarium: http://sciweb.nybg.org/science2/vii2.asp The New York Botanical Garden

Related


More Related Content