Managing Research Data Repositories for OCR-D
Research data repositories play a crucial role in the OCR-D framework, storing and managing data from document analysis processes. These repositories, like the Ground Truth (GT) repository, support FAIR principles by organizing findable, accessible, and retrievable data with metadata and provenance information. Ingestion processes involve uploading BagIt containers, extracting metadata, and indexing content for efficient retrieval. The framework also includes supported formats and container profiles to ensure data integrity and interoperability.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Repositories for OCR-D Volker Hartmann volker.hartmann@kit.edu 27.02.2019
Repository 2 Repository A central location in which data is stored and managed. Software Repository Data Repository GitHub GitLab
Glossary OCR-D 3 Research data repository The research data repository may contain the results of all steps during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository does not need to be publicly available. Ground Truth (GT) data repository Research data repository that is public available. It contains all the ground truth meta- /data. (Available at: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit)
Technical Data 4 Supported Format Containerformat: BagIt Version 0.97+ Profile: https://ocr-d.github.io/bagit-profile.json Protocol HTTP (REST) Link: https://github.com/OCR-D/repository_metastore
Metadata 5 METS: Resource Identifier (purl, urn, handle, url) Header Ground Truth(GT) Metadata (optional) (Workflow-)Provenance (optional) Provenance: Model: PROV Data Model (PROV-DM / https://www.w3.org/TR/prov-dm/) Format: PROV-XML (https://www.w3.org/TR/prov-xml/)
Architecture OCR-D Framework 6 Meta Meta- -/Daten /Daten
Ingest into the Research Data Repository 7 1. Upload BagIt-Container using REST 1. curl -u ingest:GENERATED_PASSWORD -v -F "file=@zippedBagItContainer" http://localhost:8080/api/v1/metastore/bagit Client 2. 3. 4. Unzip container Validate container Extract metadata 1. METS Header (title, ) 2. METS file 3. GT metadata (if available) 4. PROV XML (if available) Index metadata (Elasticsearch / Kibana) Server 5.
Demo OCR-D-GT-Repository 8 https://hackmd.io/RplyN-srS1mnawLC3ngQMg#
Summary 12 Research Data Repository supporting FAIR principles Findable Data described with metadata Meta-/data are registered or indexed Accessible Use open protocol (REST) Supports AAI if necessary Retrievable by identifier
Summary 13 Interoperable Meta-/data use public formats (BagIt, METS, PAGE XML, TIFF, ) Re-usable (GT) Meta-/data are released with a clear and accessible data usage license (CC BY-NC-SA 4.0 International) Meta-/data are associated with their (workflow-)provenance
14 Thank you! Questions?