DataVaut @Discovery

DataVaut @Discovery

Managing Data Storage for Research Projects

Derek Cooper Ph.D., Morgridge Institute for Research

Em Craft, Wisconsin Institute for Discovery, UW-Madison

Development of the DataVault

●

The Discovery Building opened in 2010

●

Morgridge and WID Researchers needed disk space right away

●

Considered Gluster and Luster for a shared Distributed File System

○

Gluster had recently been purchased by Red Hat

○

Luster @ Ice Cube had recently crashed was rebuilt from scratch

●

We chose Gluster for ease of management, hopeful stability, and thought that it

was a good bet that Red Hat would do well by the open-source project

Discovery Data Management: 2010-2016

●

We surveyed the labs for Meta Data to help manage the data

○

That failed miserably as none of the PI’s would fill out the paperwork

●

Data was grouped by Department

○

Department-level quotas within Organization-level quotas

○

Became scattered without knowing who owned it

○

One person (or CHTC job) in one lab could fill the volume for the entire department

○

Cleanup and archive was impossible as no one knew what or when to archive

●

Project files ended up in multiple locations

○

There was no tracking of software or raw files used for reproducibility

○

Research Data was sometimes stored in Home Directories and students graduate

○

No one person was authorized or responsible for clean-ups

Galaxy – A potential solution

●

Galaxy is an open-source software that could have helped with managing

projects and tacking meta data for reproducibility

○

Great concept – tracked files, meta data, and software versions used

○

Unfortunately, the security and manageability was not enterprise ready

○

CHTC jobs would want to run as root

○

We couldn’t lock down users' data from other users' jobs

WID DataVault: 2017

●

The DiscoverIT team and Em Craft put their heads together to solve the

numerous issues in managing the data lifecycle @ Discovery

○

WID broke off from the old cluster and bought new Dell 720XD storage servers with expansion

chassis

■

An additional server was also added for home directories (not available on the 1

st

 system)

○

A redmine ticketing system was developed to hold all the Metadata to enable tracking the

lifecycle research data

○

Including authorized users in the project allowed for the use of scripts to automatically

provision the folders and ACL’s for the projects on the production servers and also space for the

backup servers as well

Morgridge DataVault: 2023

●

Andrew Maier took over the role of data evangelist for the Morgridge DataVault.

○

Morgridge was implementing a new shared distributed file system, so this was the perfect

opportunity to promote the use of the DataVault concept as it was tried and tested with WID

●

The Morgridge DataVault was built on super-micro hardware using a mix of spinning disk and NVME

storage

●

The Distributed File System chosen this time was Ceph using erasure coding for redundancy

●

The DataVault was installed on a separate network that is a part of the CHTC block of IP Addresses

●

Traditionally, Morgridge storage systems are installed on the Morgridge network behind the firewall

Morgridge DataVault: 2024

●

Access to the DataVault was made accessible to both the UW NetID system of

user accounts and also the Discovery Active Directory Accounts

●

Making sure UIDs didn’t overlap was the biggest technical challenge

●

This allowed for Morgridge shared storage to have access to CHTC HTCondor

without a firewall in-between taking advantage of the full network bandwidth

DataVault Projects

Definitions

●

PI

: Principal Investigator

●

Project Data

: data files, software,

documents

●

Project Metadata

: who, what, when,

where, why

●

Project Completion

: UW finished

Closeout, PI/lab no longer need the

data 🤞

Lifecycle

1)

Initialize

2)

Use and Update

3)

Archive and Remove

DataVault Projects: Initialize

●

The PI (or Lab Manager) defines the metadata (

Redmine

○

[PI Name]-[Project Name]

○

Size in GB

○

Access List

○

Extras

●

DiscoverIT Server Team initializes the Project

○

Create groups and folders in DataVault (

Ceph

Python

Active Directory

○

Set initial quota (

setfattr

 for Ceph,

xfs_quota

 for xfs)

○

Configure and maintain backups (

rsnapshot

○

Update metadata and notify the PI

DataVault Projects: Use and Update

●

Project Members create and access data

○

Supported compute servers: domain-level connection (

NFS

kernel driver

○

Instruments, equipment, and other computers: user-level connection (

SMB

○

Globus endpoints available

●

DiscoverIT Server Team and PIs maintain the project

○

Access managed through on/off-boarding or one-off (

Redmine

ManageEngine

Freshworks

○

Access and usage audited each semester (

Access

PowerShell

, and a LOT of

email

○

Review usage graphs and reports (

Grafana

and

Diskover Data

○

Size and inode notifications (

Icinga

○

Fix Unix file permissions as needed (

chmod

DataVault Projects: Archive and Remove

●

After Project Completion, DiscoverIT moves the data to

DoIT’s Storage and updates the metadata. (

Globus

s3

●

DiscoverIT updates the metadata and can delete the data.

Flexible Implementation

DataVault Project Types

●

Standard

○

PI-ProjectName

○

PI-RawData

○

WID-Software

●

Shared

Other Data Storage

●

Home Directory

●

Local Storage

●

Removable Storage

●

CHTC Storage

●

Research Drive

●

Campus-Licensed Cloud

Storage

Project

Shared

Student B

Student A

PI

Slide Note

Embed Share

Download

The journey of managing research data storage at Morgridge Institute for Research and Wisconsin Institute for Discovery, including challenges faced, solutions considered, and steps taken to improve data management practices. From initial storage decisions to the development of the DataVault and improvements in data lifecycle management, this narrative sheds light on the evolution of data storage strategies in a research setting.

neal_9 Follow

Uploaded on Feb 22, 2025 | 4 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

DataVaut @Discovery Managing Data Storage for Research Projects Derek Cooper Ph.D., Morgridge Institute for Research Em Craft, Wisconsin Institute for Discovery, UW-Madison

Development of the DataVault The Discovery Building opened in 2010 Morgridge and WID Researchers needed disk space right away Considered Gluster and Luster for a shared Distributed File System Gluster had recently been purchased by Red Hat Luster @ Ice Cube had recently crashed was rebuilt from scratch We chose Gluster for ease of management, hopeful stability, and thought that it was a good bet that Red Hat would do well by the open-source project

Discovery Data Management: 2010-2016 We surveyed the labs for Meta Data to help manage the data That failed miserably as none of the PI s would fill out the paperwork Data was grouped by Department Department-level quotas within Organization-level quotas Became scattered without knowing who owned it One person (or CHTC job) in one lab could fill the volume for the entire department Cleanup and archive was impossible as no one knew what or when to archive Project files ended up in multiple locations There was no tracking of software or raw files used for reproducibility Research Data was sometimes stored in Home Directories and students graduate No one person was authorized or responsible for clean-ups

Galaxy A potential solution Galaxy is an open-source software that could have helped with managing projects and tacking meta data for reproducibility Great concept tracked files, meta data, and software versions used Unfortunately, the security and manageability was not enterprise ready CHTC jobs would want to run as root We couldn t lock down users' data from other users' jobs

WID DataVault: 2017 The DiscoverIT team and Em Craft put their heads together to solve the numerous issues in managing the data lifecycle @ Discovery WID broke off from the old cluster and bought new Dell 720XD storage servers with expansion chassis An additional server was also added for home directories (not available on the 1stsystem) A redmine ticketing system was developed to hold all the Metadata to enable tracking the lifecycle research data Including authorized users in the project allowed for the use of scripts to automatically provision the folders and ACL s for the projects on the production servers and also space for the backup servers as well

Morgridge DataVault: 2023 Andrew Maier took over the role of data evangelist for the Morgridge DataVault. Morgridge was implementing a new shared distributed file system, so this was the perfect opportunity to promote the use of the DataVaultconcept as it was tried and tested with WID The Morgridge DataVaultwas built on super-micro hardware using a mix of spinning disk and NVME storage The Distributed File System chosen this time was Ceph using erasure coding for redundancy The DataVaultwas installed on a separate network that is a part of the CHTC block of IP Addresses Traditionally, Morgridge storage systems are installed on the Morgridge network behind the firewall

Morgridge DataVault: 2024 Access to the DataVault was made accessible to both the UW NetID system of user accounts and also the Discovery Active Directory Accounts Making sure UIDs didn t overlap was the biggest technical challenge This allowed for Morgridge shared storage to have access to CHTC HTCondor without a firewall in-between taking advantage of the full network bandwidth

DataVault Projects Definitions Lifecycle PI: Principal Investigator Project Data: data files, software, documents Project Metadata: who, what, when, where, why Project Completion: UW finished Closeout, PI/lab no longer need the data 1) 2) 3) Initialize Use and Update Archive and Remove

DataVault Projects: Initialize The PI (or Lab Manager) defines the metadata (Redmine) [PI Name]-[Project Name] Size in GB Access List Extras DiscoverIT Server Team initializes the Project Create groups and folders in DataVault (Ceph, Python, Active Directory) Set initial quota (setfattr for Ceph, xfs_quota for xfs) Configure and maintain backups (rsnapshot) Update metadata and notify the PI

DataVault Projects: Use and Update Project Members create and access data Supported compute servers: domain-level connection (NFS, kernel driver) Instruments, equipment, and other computers: user-level connection (SMB) Globus endpoints available DiscoverIT Server Team and PIs maintain the project Access managed through on/off-boarding or one-off (Redmine, ManageEngine, Freshworks) Access and usage audited each semester (Access, PowerShell, and a LOT of email) Review usage graphs and reports (Grafana and Diskover Data) Size and inode notifications (Icinga) Fix Unix file permissions as needed (chmod)

DataVault Projects: Archive and Remove After Project Completion, DiscoverIT moves the data to DoIT s Storage and updates the metadata. (Globus, s3) DiscoverIT updates the metadata and can delete the data.

Flexible Implementation DataVault Project Types Other Data Storage Home Directory Local Storage Removable Storage CHTC Storage Research Drive Campus-Licensed Cloud Storage Standard PI-ProjectName PI-RawData WID-Software Shared A PI B Student A Student B Shared Project

DataVaut @Discovery

Download Presentation

Presentation Transcript

Related

More Related Content