Workshop on EOSC Service Provider Challenges and Solutions in Pisa

undefined
 
 
Kathrin Beck (MPG/MPCDF)
 
 
EOSC Service Provider Workshop
2017/09/13, Pisa
 
Introduction & aims of TextCrowd
Progress report and current status
Challenges
Further work
Success criteria
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
2
 
Franco Niccolucci, Achille Felicetti (PIN /
Florence Univ.)
Kathrin Beck, Thomas Zastrow (MPG/MPCDF;
shepherds)
 
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
3
 
Fragmented research landscape in the Social
Sciences and Humanities
EOSC: structure and integrate initiatives such as
CLARIN, DARIAH and E-RIHS ERICs, and Digital
Humanities Organizations (e.g. their Association
ADHO)
Offer advanced text-based services addressing
common research needs.
Benefit for many scientists in the long-tail even if
delivering such a service presents real challenges
around interoperability and multilingualism.
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
4
 
Cultural heritage and humanities datasets are largely
based on texts:
Reports
Archaeology: excavations, surveys
Conservation: diagnosis, restoration – often mixed with numeric
results
Grey literature
Literary/historical sources
Research articles
Monographs
Aim:
Linguistic annotated texts
Machine learning models for natural language processing (NLP)
tool chains
Automatic annotation and information extraction via NLP tools
Size: the demonstrator will work with data in the range of
megabytes, later extensible up to 2 million files
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
5
 
Named Entity (NE) categories:
Artefact
Colour
Material
Time period
Person
Place
Site
Timespan
Technique
Target output formats / ontologies:
RDF (Resource Description Framework, by W3C)
CIDOC CRM (
ICOM's
 International Committee for
Documentation 
Conceptual 
Reference 
Model
, by World
Museum Community)
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
6
 
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
7
GATE toolchain (
https://gate.ac.uk/
)
GATE pipeline has been refined and further
developed:
importing vocabularies and some pre-processing of their
content
replacing the Italian OpenNLP with FP7 project OpeNER
components via web service calls from GATE, with resulting
improvement in NER discovery
OpeNER: neuronal network instead of OpenNLP maximum
entropy model
checking OpeNER outcomes
refining stemming/lemmatization components
developing part of speech (POS) rules for filtering on nouns
when annotating
specialised timespan and period component with pattern
based rules.
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
8
 
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
9
 
Operated and maintained by CNR-ISTI
 on the
D4Science VRE
https://www.d4science.org/
Workflow engine with GATE pipeline, operated as
RESTstyle web services (running in Sheffield)
Intuitive, web based user interface
User management
Storage (private and shared files)
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
10
 
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
11
 
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
12
 
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
13
 
No annotated text corpora as training data
for machine learning algorithms available
Manual annotation of 400 pages of 
Italian
archaeology reports
 in progress (current status:
200 pages of annotation)
No user friendly Cloud-based environment
available
Desktop GATE pipeline migrated into D4Science
AAI issues
The pilot focuses on freely available texts
User management within D4Science
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
14
 
Finishing the manual annotation of training
text corpora
Training of machine-learning based NE
applications
Integration of the improved OpeNER
recogniser into the D4Science GATE pipeline
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
15
 
Creation of a text corpus for annotation
When is it big enough? When is the output good enough,
and which text types are most relevant? 
 focus on
reasonable quality for most common text types in
contemporary Italian
Interoperability of tool pipeline
GATE offers all necessary tools, except for Italian NER 
interoperability provided by GATE developers
Interoperability between TextCrowd’s toolchain and other
SSH workflow system like WebLicht
(
https://weblicht.sfs.uni-tuebingen.de/
) 
 not a focus now
Performance enhancements
User-friendly Cloud-based named entity recognition
(NER) workflow for Italian archaeologists
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
16
 
Thank you for your
attention!
 
Any questions?
 
 
13/09/2017
 
EOSCpilot workshop, Pisa
 
17
Slide Note
Embed
Share

Workshop held in Pisa addressing challenges and solutions for EOSC service providers in the realm of Social Sciences, Humanities, and Cultural Heritage. Discussions revolved around fragmented research landscapes, text-based datasets, linguistic annotations, and machine learning models for NLP tools. Named Entity categories, such as artefact, colour, material, time period, person, place, etc., were explored along with output formats like RDF and CIDOC CRM. The event aimed to integrate initiatives like CLARIN, DARIAH, E-RIHS, and Digital Humanities Organizations to benefit a wide range of scientists.

  • EOSC
  • Service Provider
  • Challenges
  • Solutions
  • Humanities

Uploaded on Sep 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Kathrin Beck (MPG/MPCDF) EOSC Service Provider Workshop 2017/09/13, Pisa

  2. Introduction & aims of TextCrowd Progress report and current status Challenges Further work Success criteria EOSCpilot workshop, Pisa 2 13/09/2017

  3. Franco Niccolucci, Achille Felicetti (PIN / Florence Univ.) Kathrin Beck, Thomas Zastrow (MPG/MPCDF; shepherds) EOSCpilot workshop, Pisa 3 13/09/2017

  4. Fragmented research landscape in the Social Sciences and Humanities EOSC: structure and integrate initiatives such as CLARIN, DARIAH and E-RIHS ERICs, and Digital Humanities Organizations (e.g. their Association ADHO) Offer advanced text-based services addressing common research needs. Benefit for many scientists in the long-tail even if delivering such a service presents real challenges around interoperability and multilingualism. EOSCpilot workshop, Pisa 4 13/09/2017

  5. Cultural heritage and humanities datasets are largely based on texts: Reports Archaeology: excavations, surveys Conservation: diagnosis, restoration often mixed with numeric results Grey literature Literary/historical sources Research articles Monographs Aim: Linguistic annotated texts Machine learning models for natural language processing (NLP) tool chains Automatic annotation and information extraction via NLP tools Size: the demonstrator will work with data in the range of megabytes, later extensible up to 2 million files EOSCpilot workshop, Pisa 5 13/09/2017

  6. Named Entity (NE) categories: Artefact Colour Material Time period Person Place Site Timespan Technique Target output formats / ontologies: RDF (Resource Description Framework, by W3C) CIDOC CRM (ICOM's International Committee for Documentation Conceptual Reference Model, by World Museum Community) EOSCpilot workshop, Pisa 6 13/09/2017

  7. EOSCpilot workshop, Pisa 7 13/09/2017

  8. GATE toolchain (https://gate.ac.uk/) GATE pipeline has been refined and further developed: importing vocabularies and some pre-processing of their content replacing the Italian OpenNLP with FP7 project OpeNER components via web service calls from GATE, with resulting improvement in NER discovery OpeNER: neuronal network instead of OpenNLP maximum entropy model checking OpeNER outcomes refining stemming/lemmatization components developing part of speech (POS) rules for filtering on nouns when annotating specialised timespan and period component with pattern based rules. EOSCpilot workshop, Pisa 8 13/09/2017

  9. EOSCpilot workshop, Pisa 9 13/09/2017

  10. Operated and maintained by CNR-ISTI on the D4Science VRE https://www.d4science.org/ Workflow engine with GATE pipeline, operated as RESTstyle web services (running in Sheffield) Intuitive, web based user interface User management Storage (private and shared files) EOSCpilot workshop, Pisa 10 13/09/2017

  11. EOSCpilot workshop, Pisa 11 13/09/2017

  12. EOSCpilot workshop, Pisa 12 13/09/2017

  13. EOSCpilot workshop, Pisa 13 13/09/2017

  14. No annotated text corpora as training data for machine learning algorithms available Manual annotation of 400 pages of Italian archaeology reports in progress (current status: 200 pages of annotation) No user friendly Cloud-based environment available Desktop GATE pipeline migrated into D4Science AAI issues The pilot focuses on freely available texts User management within D4Science EOSCpilot workshop, Pisa 14 13/09/2017

  15. Finishing the manual annotation of training text corpora Training of machine-learning based NE applications Integration of the improved OpeNER recogniser into the D4Science GATE pipeline EOSCpilot workshop, Pisa 15 13/09/2017

  16. Creation of a text corpus for annotation When is it big enough? When is the output good enough, and which text types are most relevant? focus on reasonable quality for most common text types in contemporary Italian Interoperability of tool pipeline GATE offers all necessary tools, except for Italian NER interoperability provided by GATE developers Interoperability between TextCrowd s toolchain and other SSH workflow system like WebLicht (https://weblicht.sfs.uni-tuebingen.de/) not a focus now Performance enhancements User-friendly Cloud-based named entity recognition (NER) workflow for Italian archaeologists EOSCpilot workshop, Pisa 16 13/09/2017

  17. Thank you for your attention! Any questions? EOSCpilot workshop, Pisa 17 13/09/2017

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#