Managing Research Data Repositories for OCR-D

undefined
 
Repositories for OCR-D
Volker Hartmann
volker.hartmann@kit.edu
 
 
2
7
.
0
2
.
2
0
1
9
 
S
o
f
t
w
a
r
e
 
R
e
p
o
s
i
t
o
r
y
 
GitHub
GitLab
D
a
t
a
 
R
e
p
o
s
i
t
o
r
y
 
 
Repository
 
Repository
A central location in which data is stored and managed.
 
2
 
 
Glossary OCR-D
 
Research data repository
The research data repository may contain the results of all steps during document
analysis. At least it contains the end results of every processed document and its full
provenance. The research data repository does not need to be publicly available.
Ground Truth (GT) data repository
Research data repository that is public available. It contains all the 
ground truth 
meta-
/data. (Available at: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit)
 
3
 
 
Technical Data
 
 
Supported Format
Containerformat: 
 
BagIt Version 0.97+
Profile: 
 
https://ocr-d.github.io/bagit-profile.json
 
Protocol
HTTP (REST)
 
4
 
Link: 
https://github.com/OCR-D/repository_metastore
 
 
Metadata
 
METS:
Resource Identifier (purl, urn, handle, url)
Header
Ground Truth(GT) Metadata (optional)
(Workflow-)Provenance (optional)
 
Provenance:
Model: 
 
PROV Data Model (PROV-DM / 
https://www.w3.org/TR/prov-dm/
)
Format: 
 
PROV-XML (https://www.w3.org/TR/prov-xml/)
 
5
 
 
Architecture OCR-D Framework
 
6
 
M
e
t
a
-
/
D
a
t
e
n
 
 
Ingest into the Research Data Repository
 
1.
Upload BagIt-Container using REST
1.
curl -u ingest:GENERATED_PASSWORD -v -F "file=@zippedBagItContainer"
http://localhost:8080/api/v1/metastore/bagit
 
 
2.
Unzip container
3.
Validate container
4.
Extract metadata
1.
METS Header (title, …)
2.
METS file
3.
GT metadata (if available)
4.
PROV XML (if available)
5.
Index metadata (Elasticsearch / Kibana)
 
7
 
Client
 
 
 
Server
 
 
Demo OCR-D-GT-Repository
 
 
 
 
 
https://hackmd.io/RplyN-srS1mnawLC3ngQMg#
 
8
 
 
9
L
i
s
t
 
U
R
L
s
 
o
f
 
a
l
l
 
B
a
g
I
t
-
C
o
n
t
a
i
n
e
r
s
> curl –X GET https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit
[ „https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/9c06dd3b-e921-4264-aef9-
ab2e9d276328/data/SBB0000F23300010000.zip“, … ]
 
 
L
i
s
t
 
c
o
n
t
e
n
t
 
o
f
 
a
l
l
 
M
E
T
S
 
f
i
l
e
s
> curl –X GET https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets
[{ „id“: „296848“,
  „resourceId“ : „ 9c06dd3b-e921-4264-aef9-ab2e9d276328“,
  „metsContent“ : „<mets:mets xmlns:mets=\“http://www.loc.gov/METS/\“ … </mets:mets>“ },
]
 
(REST-)Examples for Accessing the
Research Data Repository
 
 
10
L
i
s
t
 
a
l
l
 
r
e
s
o
u
r
c
e
 
i
d
e
n
t
i
f
i
e
r
s
 
f
o
r
 
g
i
v
e
n
 
t
i
t
l
e
 
D
e
r
 
H
e
r
o
l
d
> curl –X GET „https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/title?title=Der%20Herold“
[9c06dd3b-e921-4264-1ef9-ab2e9d276328]
 
 
L
i
s
t
 
c
o
n
t
e
n
t
 
o
f
 
M
E
T
S
 
f
i
l
e
s
 
f
o
r
 
g
i
v
e
n
 
r
e
s
o
u
r
c
e
 
i
d
e
n
t
i
f
i
e
r
> curl –X GET https://ocr-d-repo.scc.kit.edu/api/v1/metastore/mets/9c06dd3b-e921-4264-aef9-
ab2e9d276328
{ „id“: „296848“,
  „resourceId“ : „ 9c06dd3b-e921-4264-aef9-ab2e9d276328“,
  „metsContent“ : „<mets:mets xmlns:mets=\“http://www.loc.gov/METS/\“ … </mets:mets>“ }
 
(REST-)Examples for Accessing the
Research Data Repository
 
 
11
D
o
w
n
l
o
a
d
 
B
a
g
I
t
-
C
o
n
t
a
i
n
e
r
 
f
r
o
m
 
r
e
p
o
s
i
t
o
r
y
> curl –X GET https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/9c06dd3b-e921-4264-aef9-
ab2e9d276328/data/SBB0000F29300010000.zip > SBB0000F29300010000.zip
 %    Total      % Received % Xferd Average  Speed  Time     Time      Time    Current
                                                        Dload     Upload  Total     Spent      Left     Speed
100   11.7M 100 11.7M      0         0   46.0M          0   --:--:--    --:--:--     --:--:--     46.1M
 
(REST-)Examples for Accessing the
Research Data Repository
 
 
Summary
 
Research Data Repository supporting FAIR principles
Findable
Data described with metadata
Meta-/data are registered or indexed
Accessible
Use open protocol (REST)
Supports AAI if necessary
Retrievable by identifier
 
12
 
 
Summary
 
Interoperable
Meta-/data use public formats (BagIt, METS, PAGE XML, TIFF, …)
 
Re-usable
(GT) Meta-/data are released with a clear and accessible data usage
license (CC BY-NC-SA 4.0 International)
Meta-/data are associated with their (workflow-)provenance
 
13
 
 
14
 
 
Thank you!
 
Questions?
Slide Note
Embed
Share

Research data repositories play a crucial role in the OCR-D framework, storing and managing data from document analysis processes. These repositories, like the Ground Truth (GT) repository, support FAIR principles by organizing findable, accessible, and retrievable data with metadata and provenance information. Ingestion processes involve uploading BagIt containers, extracting metadata, and indexing content for efficient retrieval. The framework also includes supported formats and container profiles to ensure data integrity and interoperability.

  • Research data
  • OCR-D framework
  • Data repositories
  • Ground Truth
  • FAIR principles

Uploaded on Oct 05, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Repositories for OCR-D Volker Hartmann volker.hartmann@kit.edu 27.02.2019

  2. Repository 2 Repository A central location in which data is stored and managed. Software Repository Data Repository GitHub GitLab

  3. Glossary OCR-D 3 Research data repository The research data repository may contain the results of all steps during document analysis. At least it contains the end results of every processed document and its full provenance. The research data repository does not need to be publicly available. Ground Truth (GT) data repository Research data repository that is public available. It contains all the ground truth meta- /data. (Available at: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit)

  4. Technical Data 4 Supported Format Containerformat: BagIt Version 0.97+ Profile: https://ocr-d.github.io/bagit-profile.json Protocol HTTP (REST) Link: https://github.com/OCR-D/repository_metastore

  5. Metadata 5 METS: Resource Identifier (purl, urn, handle, url) Header Ground Truth(GT) Metadata (optional) (Workflow-)Provenance (optional) Provenance: Model: PROV Data Model (PROV-DM / https://www.w3.org/TR/prov-dm/) Format: PROV-XML (https://www.w3.org/TR/prov-xml/)

  6. Architecture OCR-D Framework 6 Meta Meta- -/Daten /Daten

  7. Ingest into the Research Data Repository 7 1. Upload BagIt-Container using REST 1. curl -u ingest:GENERATED_PASSWORD -v -F "file=@zippedBagItContainer" http://localhost:8080/api/v1/metastore/bagit Client 2. 3. 4. Unzip container Validate container Extract metadata 1. METS Header (title, ) 2. METS file 3. GT metadata (if available) 4. PROV XML (if available) Index metadata (Elasticsearch / Kibana) Server 5.

  8. Demo OCR-D-GT-Repository 8 https://hackmd.io/RplyN-srS1mnawLC3ngQMg#

  9. Summary 12 Research Data Repository supporting FAIR principles Findable Data described with metadata Meta-/data are registered or indexed Accessible Use open protocol (REST) Supports AAI if necessary Retrievable by identifier

  10. Summary 13 Interoperable Meta-/data use public formats (BagIt, METS, PAGE XML, TIFF, ) Re-usable (GT) Meta-/data are released with a clear and accessible data usage license (CC BY-NC-SA 4.0 International) Meta-/data are associated with their (workflow-)provenance

  11. 14 Thank you! Questions?

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#