Standardization Process for Metadata Components in Language Archives

undefined
 
Metadata Component Framework
Possible Standardization Work within
TC37/SC4
 
Status of ISOCat Metadata Thematic
Domain
 
Based on extensive discussions with experts
(Athens,…)
Analysing existing metadata sets & inventories:
OLAC, IMDI, ENABLER, …
Input from CLARIN EU community that also
provided translations
Dedicated CLARIN NL metadata project
 
Currently 220 metadata DCs
 
Further steps
 
Time to take stock of the situation
How to proceed with standardization?
Metadata TDG chair proposal is:
Today walk through the current set of metadata DCs and
take note of all comments
Start round two, contact all previous contributors (incl.
those that were active providing translations)
After Christmas, the TDG will take charge again
ISOCat supported standardization procedures are in
place, but is this sufficient?
TDG members representative enough?
DCs ownership not completely transparent
Maybe just startup problems or …
 
Standardization process
 
Create “submission group”, collection of DC
owners
Create (one or more coherent) DC selection(s)
After submission, the TDG chair forms a
decision group {chair, at least one other
member}
Decision group decides on all submitted DCs
This is further forwarded to the DCR board
DCR board has to validate and bring to
publication
 
DCR Decision process
 
Standardization & Metadata Components
 
Assumption is that there is broad agreement about
limitations existing metadata schemas: DC/OLAC, IMDI,
TEI header
Inflexible: too many (IMDI) or too few (OLAC) metadata
elements
Limited interoperability (both semantic and syntactic)
Problematic  (unfamiliar) terminology for some sub-
communities.
Limited support for LT tool & services descriptions
Address this by:
Explicit defined schema & semantics
User/project/community defined components
Metadata Components
 
 
NOT a single new metadata schema
but rather allow coexistence of many
(community/researcher) defined schemas
with explicit semantics for interoperability
 
How does this work?
Components are bundles of related metadata elements
that describe an aspect of the resource
A complete description of a resource may require several
components.
Components may contain other components
Components should be designed for reusability
 
 
Metadata Components
 
T
e
c
h
n
i
c
a
l
M
e
t
a
d
a
t
a
 
Sample frequency
 
Format
 
Size
 
 
Lets describe a
speech recording
 
Metadata Components
Language
 
T
e
c
h
n
i
c
a
l
M
e
t
a
d
a
t
a
 
Name
 
Id
 
 
Lets describe a
speech recording
 
Metadata Components
Language
 
T
e
c
h
n
i
c
a
l
M
e
t
a
d
a
t
a
Actor
 
Sex
 
Language
 
Age
 
Name
 
 
Lets describe a
speech recording
 
Metadata Components
Language
 
T
e
c
h
n
i
c
a
l
M
e
t
a
d
a
t
a
Actor
L
o
c
a
t
i
o
n
 
 
Continent
 
Country
 
Address
 
Lets describe a
speech recording
 
Metadata Components
Language
 
T
e
c
h
n
i
c
a
l
M
e
t
a
d
a
t
a
Actor
L
o
c
a
t
i
o
n
Project
 
 
Name
 
Contact
 
Lets describe a
speech recording
Metadata Components
Language
 
T
e
c
h
n
i
c
a
l
M
e
t
a
d
a
t
a
Actor
L
o
c
a
t
i
o
n
Project
 
Metadata schema
 
Metadata description
Lets describe a 
speech recording
Component definition
XML
W3C XML Schema
XML File
Profile definition
XML
 
Metadata profile
ActorLanguage
 
Recursive
 
Recursive Component
model
Components can contain
other components
Enhances reusability
Actor
Address
L
o
c
a
t
i
o
n
Project
Country      dcr:1001
Language   dcr:1002
Component registry
BirthDate   dcr:1000
 
ISOcat
concept
registry
user
 
Semantic interoperability
partly
 solved via references
to ISO DCR or other registry
Selecting metadata components & profiles from the registry
Title:          dc:title
 
DCMI
concept
registry
Reusability & Explicit Semantics
User selects appropriate
components to create a
new metadata profile  or
selects an existing
profile
ISOCat or ISO DCR
implementation of ISO
-12620
standard for data categories
under control of the linguistic
community ISO TC37
Metadata is just one of the
seven “thematic domains”
 
How to proceed?
 
Decouple separate issues?
Standardization of metadata DCs in the ISO-DCR
------------------------------------------------------------------
Defining Requirements for a Metadata Component
Model
Standardizing the Component Model itself
Standardizing a Component Specification Language
Design/Specify a number of recommended components
for specific data types and usages.
 
Of course building on existing or continuing work like
ISOCat, PISA, … where possible
 
Requirements for the component model
 
For example:
Component has attributes: name, multiplicity, concept-
link, …
Component model should support recursion
A component contains a number of metadata elements
A metadata element has a: name, value-scheme,
multiplicity, concept link
A component can refer to a number of resources or to
other metadata components
A component can contain information about resource
relations
A component grammar has to be fully deterministic to
avoid ambiguity
 
 
Metadata Component Model
 
Should embody the requirements without
defining a component specification language
As an example the CLARIN CMDI component
model
 
Component Specification Language
 
<CMD_ComponentSpec isProfile="false">
  <Header>
    <ID>clarin.eu:cr1:c_1271859438108</ID>
    <Name>iso-639-5</Name>
    <Description>The list of ISO-639-5 language families. Based on:
http://en.wikipedia.org/wiki/List_of_ISO_639-5_codes</Description>
  </Header>
  <CMD_Component name="ISO635">
    <CMD_Element CardinalityMax="unbounded" CardinalityMin="1" name="iso-639-5-
code">
      <ValueScheme>
        
<enumeration>
          
<item AppInfo="Austro-Asiatic languages" ConceptLink="http://cdb.iso.org/lg/CDB-
00138763-001">aav</item>
 
          <item AppInfo="Afro-Asiatic languages" ConceptLink="http://cdb.iso.org/lg/CDB-
00138759-001">afa</item>
 
          <item AppInfo="Algonquian languages" ConceptLink="http://cdb.iso.org/lg/CDB-
00138721-001">alg</item>
 
          <item AppInfo="Atlantic-Congo languages" ConceptLink="http://cdb.iso.org/lg/CDB-
00138719-001">alv</item>
 
[...]
 
CLARIN Component example: ISO-635 component
 
ISO Recommended Components
 
Our CMDI experience is that we need to limit
the proliferation of components
Offer a set of standardized ones for use with
specific data-types
for specific purposes
 
Reference implementation
 
CLARIN has been working on:
Metadata component registry and editor
Metadata editor
If a potential ISO standard for component
model and specification is not too different
from CLARIN requirements and practice these
could serve as a reference implementation
 
Persistent Identifier Use
 
All is based on the recent FDIS-24619 PISA
Cool URIs for the concept links to ISOCat and
ISOCDB
All references to resources and metadata can
contain PIDs
undefined
 
Thank you for your attention
Slide Note
Embed
Share

The standardization process for metadata components within TC37/SC4 at the Language Archive Max Planck Institute involves analyzing existing metadata sets, seeking input from the CLARIN EU community, and determining the next steps for standardization. The process includes forming submission groups, selecting data categories, making decisions on submitted components, and validating for publication. The aim is to address limitations in existing metadata schemas and enhance support for linguistic tool and service descriptions.


Uploaded on Aug 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Metadata Component Framework Possible Standardization Work within TC37/SC4 The Language Archive Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands

  2. Status of ISOCat Metadata Thematic Domain Based on extensive discussions with experts (Athens, ) Analysing existing metadata sets & inventories: OLAC, IMDI, ENABLER, Input from CLARIN EU community that also provided translations Dedicated CLARIN NL metadata project Currently 220 metadata DCs

  3. Further steps Time to take stock of the situation How to proceed with standardization? Metadata TDG chair proposal is: Today walk through the current set of metadata DCs and take note of all comments Start round two, contact all previous contributors (incl. those that were active providing translations) After Christmas, the TDG will take charge again ISOCat supported standardization procedures are in place, but is this sufficient? TDG members representative enough? DCs ownership not completely transparent Maybe just startup problems or

  4. Standardization process Create submission group , collection of DC owners Create (one or more coherent) DC selection(s) After submission, the TDG chair forms a decision group {chair, at least one other member} Decision group decides on all submitted DCs This is further forwarded to the DCR board DCR board has to validate and bring to publication

  5. DCR Decision process Decision Group Submission group Thematic Domain Group Data Category Registry Board Stewardship group Evaluation Validation rejected rejected Publication

  6. Standardization & Metadata Components Assumption is that there is broad agreement about limitations existing metadata schemas: DC/OLAC, IMDI, TEI header Inflexible: too many (IMDI) or too few (OLAC) metadata elements Limited interoperability (both semantic and syntactic) Problematic (unfamiliar) terminology for some sub- communities. Limited support for LT tool & services descriptions Address this by: Explicit defined schema & semantics User/project/community defined components

  7. Metadata Components NOT a single new metadata schema but rather allow coexistence of many (community/researcher) defined schemas with explicit semantics for interoperability How does this work? Components are bundles of related metadata elements that describe an aspect of the resource A complete description of a resource may require several components. Components may contain other components Components should be designed for reusability

  8. Metadata Components Lets describe a speech recording Sample frequency Format Size Technical Metadata

  9. Metadata Components Lets describe a speech recording Name Language Id Technical Metadata

  10. Metadata Components Lets describe a speech recording Name Actor Age Sex Language Language Technical Metadata

  11. Metadata Components Lets describe a speech recording Continent Country Address Location Actor Language Technical Metadata

  12. Metadata Components Name Project Lets describe a speech recording Contact Location Actor Language Technical Metadata

  13. Metadata Components Project Lets describe a speech recording Location Profile definition XML Actor Metadata schema W3C XML Schema Language Technical Metadata Component definition XML Metadata description XML File Metadata profile

  14. Recursive Project Recursive Component model Components can contain other components Enhances reusability Location Actor ActorLanguage Address

  15. Reusability & Explicit Semantics Component registry User selects appropriate components to create a new metadata profile or selects an existing profile Location Country Coordinates Text Language Title user Semantic interoperability partly solved via references to ISO DCR or other registry Actor BirthDate MotherTongue ISOCat or ISO DCR implementation of ISO-12620 standard for data categories under control of the linguistic community ISO TC37 Metadata is just one of the seven thematic domains Recording CreationDate Type Dance Country dcr:1001 Language dcr:1002 BirthDate dcr:1000 ISOcat concept registry Name Type DCMI concept registry Title: dc:title Selecting metadata components & profiles from the registry

  16. How to proceed? Decouple separate issues? Standardization of metadata DCs in the ISO-DCR ------------------------------------------------------------------ Defining Requirements for a Metadata Component Model Standardizing the Component Model itself Standardizing a Component Specification Language Design/Specify a number of recommended components for specific data types and usages. Of course building on existing or continuing work like ISOCat, PISA, where possible

  17. Requirements for the component model For example: Component has attributes: name, multiplicity, concept- link, Component model should support recursion A component contains a number of metadata elements A metadata element has a: name, value-scheme, multiplicity, concept link A component can refer to a number of resources or to other metadata components A component can contain information about resource relations A component grammar has to be fully deterministic to avoid ambiguity

  18. Metadata Component Model Should embody the requirements without defining a component specification language As an example the CLARIN CMDI component model

  19. Component Specification Language CLARIN Component example: ISO-635 component <CMD_ComponentSpec isProfile="false"> <Header> <ID>clarin.eu:cr1:c_1271859438108</ID> <Name>iso-639-5</Name> <Description>The list of ISO-639-5 language families. Based on: http://en.wikipedia.org/wiki/List_of_ISO_639-5_codes</Description> </Header> <CMD_Component name="ISO635"> <CMD_Element CardinalityMax="unbounded" CardinalityMin="1" name="iso-639-5- code"> <ValueScheme> <enumeration> <item AppInfo="Austro-Asiatic languages" ConceptLink="http://cdb.iso.org/lg/CDB- 00138763-001">aav</item> <item AppInfo="Afro-Asiatic languages" ConceptLink="http://cdb.iso.org/lg/CDB- 00138759-001">afa</item> <item AppInfo="Algonquian languages" ConceptLink="http://cdb.iso.org/lg/CDB- 00138721-001">alg</item> <item AppInfo="Atlantic-Congo languages" ConceptLink="http://cdb.iso.org/lg/CDB- 00138719-001">alv</item> [...]

  20. ISO Recommended Components Our CMDI experience is that we need to limit the proliferation of components Offer a set of standardized ones for use with specific data-types for specific purposes

  21. Reference implementation CLARIN has been working on: Metadata component registry and editor Metadata editor If a potential ISO standard for component model and specification is not too different from CLARIN requirements and practice these could serve as a reference implementation

  22. Persistent Identifier Use All is based on the recent FDIS-24619 PISA Cool URIs for the concept links to ISOCat and ISOCDB All references to resources and metadata can contain PIDs

  23. Thank you for your attention The Language Archive Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#