Workshop on EOSC Service Provider Challenges and Solutions in Pisa

Slide Note

Workshop held in Pisa addressing challenges and solutions for EOSC service providers in the realm of Social Sciences, Humanities, and Cultural Heritage. Discussions revolved around fragmented research landscapes, text-based datasets, linguistic annotations, and machine learning models for NLP tools. Named Entity categories, such as artefact, colour, material, time period, person, place, etc., were explored along with output formats like RDF and CIDOC CRM. The event aimed to integrate initiatives like CLARIN, DARIAH, E-RIHS, and Digital Humanities Organizations to benefit a wide range of scientists.

gan_te Follow

Uploaded on Sep 25, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Kathrin Beck (MPG/MPCDF) EOSC Service Provider Workshop 2017/09/13, Pisa

Introduction & aims of TextCrowd Progress report and current status Challenges Further work Success criteria EOSCpilot workshop, Pisa 2 13/09/2017

Franco Niccolucci, Achille Felicetti (PIN / Florence Univ.) Kathrin Beck, Thomas Zastrow (MPG/MPCDF; shepherds) EOSCpilot workshop, Pisa 3 13/09/2017

Fragmented research landscape in the Social Sciences and Humanities EOSC: structure and integrate initiatives such as CLARIN, DARIAH and E-RIHS ERICs, and Digital Humanities Organizations (e.g. their Association ADHO) Offer advanced text-based services addressing common research needs. Benefit for many scientists in the long-tail even if delivering such a service presents real challenges around interoperability and multilingualism. EOSCpilot workshop, Pisa 4 13/09/2017

Cultural heritage and humanities datasets are largely based on texts: Reports Archaeology: excavations, surveys Conservation: diagnosis, restoration often mixed with numeric results Grey literature Literary/historical sources Research articles Monographs Aim: Linguistic annotated texts Machine learning models for natural language processing (NLP) tool chains Automatic annotation and information extraction via NLP tools Size: the demonstrator will work with data in the range of megabytes, later extensible up to 2 million files EOSCpilot workshop, Pisa 5 13/09/2017

Named Entity (NE) categories: Artefact Colour Material Time period Person Place Site Timespan Technique Target output formats / ontologies: RDF (Resource Description Framework, by W3C) CIDOC CRM (ICOM's International Committee for Documentation Conceptual Reference Model, by World Museum Community) EOSCpilot workshop, Pisa 6 13/09/2017

EOSCpilot workshop, Pisa 7 13/09/2017

GATE toolchain (https://gate.ac.uk/) GATE pipeline has been refined and further developed: importing vocabularies and some pre-processing of their content replacing the Italian OpenNLP with FP7 project OpeNER components via web service calls from GATE, with resulting improvement in NER discovery OpeNER: neuronal network instead of OpenNLP maximum entropy model checking OpeNER outcomes refining stemming/lemmatization components developing part of speech (POS) rules for filtering on nouns when annotating specialised timespan and period component with pattern based rules. EOSCpilot workshop, Pisa 8 13/09/2017

EOSCpilot workshop, Pisa 9 13/09/2017

Operated and maintained by CNR-ISTI on the D4Science VRE https://www.d4science.org/ Workflow engine with GATE pipeline, operated as RESTstyle web services (running in Sheffield) Intuitive, web based user interface User management Storage (private and shared files) EOSCpilot workshop, Pisa 10 13/09/2017

EOSCpilot workshop, Pisa 11 13/09/2017

EOSCpilot workshop, Pisa 12 13/09/2017

EOSCpilot workshop, Pisa 13 13/09/2017

No annotated text corpora as training data for machine learning algorithms available Manual annotation of 400 pages of Italian archaeology reports in progress (current status: 200 pages of annotation) No user friendly Cloud-based environment available Desktop GATE pipeline migrated into D4Science AAI issues The pilot focuses on freely available texts User management within D4Science EOSCpilot workshop, Pisa 14 13/09/2017

Finishing the manual annotation of training text corpora Training of machine-learning based NE applications Integration of the improved OpeNER recogniser into the D4Science GATE pipeline EOSCpilot workshop, Pisa 15 13/09/2017

Creation of a text corpus for annotation When is it big enough? When is the output good enough, and which text types are most relevant? focus on reasonable quality for most common text types in contemporary Italian Interoperability of tool pipeline GATE offers all necessary tools, except for Italian NER interoperability provided by GATE developers Interoperability between TextCrowd s toolchain and other SSH workflow system like WebLicht (https://weblicht.sfs.uni-tuebingen.de/) not a focus now Performance enhancements User-friendly Cloud-based named entity recognition (NER) workflow for Italian archaeologists EOSCpilot workshop, Pisa 16 13/09/2017