Managing Research Data Repositories for OCR-D

Slide Note
Embed
Share

Research data repositories play a crucial role in the OCR-D framework, storing and managing data from document analysis processes. These repositories, like the Ground Truth (GT) repository, support FAIR principles by organizing findable, accessible, and retrievable data with metadata and provenance information. Ingestion processes involve uploading BagIt containers, extracting metadata, and indexing content for efficient retrieval. The framework also includes supported formats and container profiles to ensure data integrity and interoperability.


Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Repositories for OCR-D Volker Hartmann volker.hartmann@kit.edu 27.02.2019

  2. Repository 2 Repository A central location in which data is stored and managed. Software Repository Data Repository GitHub GitLab

  3. Glossary OCR-D 3 Research data repository The research data repository may contain the results of all steps during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository does not need to be publicly available. Ground Truth (GT) data repository Research data repository that is public available. It contains all the ground truth meta- /data. (Available at: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit)

  4. Technical Data 4 Supported Format Containerformat: BagIt Version 0.97+ Profile: https://ocr-d.github.io/bagit-profile.json Protocol HTTP (REST) Link: https://github.com/OCR-D/repository_metastore

  5. Metadata 5 METS: Resource Identifier (purl, urn, handle, url) Header Ground Truth(GT) Metadata (optional) (Workflow-)Provenance (optional) Provenance: Model: PROV Data Model (PROV-DM / https://www.w3.org/TR/prov-dm/) Format: PROV-XML (https://www.w3.org/TR/prov-xml/)

  6. Architecture OCR-D Framework 6 Meta Meta- -/Daten /Daten

  7. Ingest into the Research Data Repository 7 1. Upload BagIt-Container using REST 1. curl -u ingest:GENERATED_PASSWORD -v -F "file=@zippedBagItContainer" http://localhost:8080/api/v1/metastore/bagit Client 2. 3. 4. Unzip container Validate container Extract metadata 1. METS Header (title, ) 2. METS file 3. GT metadata (if available) 4. PROV XML (if available) Index metadata (Elasticsearch / Kibana) Server 5.

  8. Demo OCR-D-GT-Repository 8 https://hackmd.io/RplyN-srS1mnawLC3ngQMg#

  9. Summary 12 Research Data Repository supporting FAIR principles Findable Data described with metadata Meta-/data are registered or indexed Accessible Use open protocol (REST) Supports AAI if necessary Retrievable by identifier

  10. Summary 13 Interoperable Meta-/data use public formats (BagIt, METS, PAGE XML, TIFF, ) Re-usable (GT) Meta-/data are released with a clear and accessible data usage license (CC BY-NC-SA 4.0 International) Meta-/data are associated with their (workflow-)provenance

  11. 14 Thank you! Questions?

Related


More Related Content