DKRZ Data Center Overview: Services and Infrastructure Updates
DKRZ is updating its data infrastructure hosting environment to enhance services like data life cycle management, quality assurance, and CMIP6 support. The data center is undergoing migration to integrate HPC and data systems, establishing a national MIP data analysis cache and cloud. Long-term archival and data citation processes are being improved to support data replication, quality assurance, and DOI assignments. Integration efforts with WDCC, CERA, and HPSS are operational for CMIP5 and future CMIP6, focusing on metadata, data node improvements, and system enhancements. Quality assurance software is being restructured and modularized for better support of CMIP6 and CORDEX initiatives.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
DKRZ Data center requirements and services Stephan Kindermann, Michael Lautenschlager, Katharina Berger, Tobias Weigel, Hans Dieter Hollweg Deutsches Klimarechenzentrum (DKRZ) S. Kindermann (DKRZ)
Overview Update: the new data infrastructure hosting environment at DKRZ ESGF: DKRZ data life cycle services LTA / WDCC ESGF integration Quality assurance Data near processing Towards PID based services CMIP6 at DKRZ S. Kindermann (DKRZ) 2 14.12.2024
DKRZ data center update Migration to new integrated HPC / data system separate DTNs (starting 2016) establishment of a national MIP data analysis cache data cloud to support data ingest process until end 2015 from 2016 from mid 2015 (pre-shutdown) ESGF infrastructure HH: 2x10 GB DFN: 2x3..5 GB 4 data nodes all: behind firewall Openstack cloud Index node G P F S Mistral 2 DTNs No separate DTNs 2 data nodes L U S T R E 1 CERA/ESGF data node VMs (XEN) HPC + Interactive nodes + visualization nodes ro NFS CERA / ESGF portal CERA LTA infrastructure LTA data node CERA portal LTA (Oracle) Oracle DB cluster HPSS national MIP data cache management etc. tbd S. Kindermann (DKRZ) 3 14.12.2024
DKRZ long term archival and data citation Mayor use case Replication Support data evaluation Quality Assurance Long Term Archival DOI assignment Exposure as ESGF data node ESGF shutdown CERA Portal / DDC.. ESGF WPS COG portal Data node Data near processing replication versioning container server CERA (Oracle) QA DOI National climate data node (MIP cache ) ESGF Process LTA (HPSS) container cache ingest S. Kindermann (DKRZ) 4 14.12.2024
WDCC / CERA / HPSS ESGF integration Operational for CMIP5 CERA metadata (Oracle) ESGF index Thredds server with ESGF security filter + HPSS data container server ESGF data node Improved system for CMIP6: FUSE based mounting of DKRZ HPSS/cache legacy system Extraction of CERA metadata for ESGF mapfile standard standard ESGF publication in an offline mode LTA ESGF Datanode Postgres/ THREDDS ESGFIndexnode ESGF Solr index COG portal ESGF Publisher Mapfile generation FUSE container server CERA (Oracle) Future COG portal visibility of (non CMIP) WDCC LTA project data LTA (HPSS) container cache S. Kindermann (DKRZ) 5 14.12.2024
(CMIP data) Quality Assurance Software Completely re-structured and modularized: Flexible configuration Used heavily for CORDEX will support CMIP6 Separate cf-checker module NetCDF File main File NC-API M-D Store Annotations User-modified Directives CF Conventions Tables CF Conv. Checks Consistency between sub- temporal files QA CF Conventions Check Versions: 1.4 - 1.6 Project Rules Data DRS CV 8-9 Chapters of rules Variable Requirements (CMOR) Time table based config (area-type, cf-standard- name, stand-region-name, ..) Project Configuration & Tables Source code: https://github.com/h-dh/QA-DKRZ Pre-packaged versions: conda based, docker based Documentation: http://qa-dkrz.readthedocs.org/en/latest/qa-user-manual.html S. Kindermann (DKRZ) 6 14.12.2024
National MIP data analysis cache / node Ad hoc approach Data needed help desk data manager RO mounted on HPC data analysis nodes Support for data analysis VM deployment Support for tool dependency management (install recipes, conda, docker) WPS framework to support web service deployments Birdhouse (https://github.com/bird-house ) conda/docker support Support for home institution (test-) deployments transparent solution: WPS Data near processing replication versioning National climate data node (MIP cache ) ESGF ingest S. Kindermann (DKRZ) 7 14.12.2024
Stable file/collection management !? ESGF WPS COG portal CERA Portal / .. Data node Data near processing replication versioning container server CERA (Oracle) QA DOI National climate data node (MIP cache ) ESGF Process LTA (HPSS) container cache ingest S. Kindermann (DKRZ) 8 8 14.12.2024 14.12.2024
Towards PID based services Motivation: Stable ESGF data space based on PID infrastructure Collaborations: ePIC: DKRZ partner prefix registration EUDAT: DKRZ leads PID task API RDA: DKRZ co-chairs PIT and collections WGs Envri+: PIDs in environmental sciences Next ESGF steps: Test-Environment (PID system + publisher) Scalable, stable PID assigment: CMOR integration, CDNOT involvement PID API / ESGF publisher integration High available message queuing system integration S. Kindermann (DKRZ) 9 14.12.2024
Summary Long term archival use case ESGF integration Quality Assurance PID assigment early in data life cycle early citation and DOI assignment future PID based data management services future PID based end user services future PID based provenance support S. Kindermann (DKRZ) 10 14.12.2024
.. Thank You S. Kindermann (DKRZ) 11 14.12.2024
DKRZ services New developments New integrated HPC/Data System installed in 2015, ~ 50 PByte Lustre Storage cloud (openstack) Community data analysis cache and platform ESGF: WDCC/HPSS/ESGF data node WPS compute platform birdhouse data ingest Towards PID / early citation services S. Kindermann (DKRZ) 12 14.12.2024
(Early) Data Citation (DM + ESGF) Impact on CMIP6 data management (DM) and ESGF governance (ESGF) Request from modelling groups for a data citation reference just after ESGF data publication CMIP6 data publication workflow: CMIP6 citation granularities are collection levels: Simulation Model S. Kindermann (DKRZ) iCAS2015 13 13 14.12.2024 14.12.2024