DataVaut @Discovery

DataVaut @Discovery
Managing Data Storage for Research Projects
Derek Cooper Ph.D., Morgridge Institute for Research
Em Craft, Wisconsin Institute for Discovery, UW-Madison
Development of the DataVault
The Discovery Building opened in 2010
Morgridge and WID Researchers needed disk space right away
Considered Gluster and Luster for a shared Distributed File System
Gluster had recently been purchased by Red Hat
Luster @ Ice Cube had recently crashed was rebuilt from scratch
We chose Gluster for ease of management, hopeful stability, and thought that it
was a good bet that Red Hat would do well by the open-source project
Discovery Data Management: 2010-2016
We surveyed the labs for Meta Data to help manage the data
That failed miserably as none of the PI’s would fill out the paperwork
Data was grouped by Department
Department-level quotas within Organization-level quotas
Became scattered without knowing who owned it
One person (or CHTC job) in one lab could fill the volume for the entire department
Cleanup and archive was impossible as no one knew what or when to archive
Project files ended up in multiple locations
There was no tracking of software or raw files used for reproducibility
Research Data was sometimes stored in Home Directories and students graduate
No one person was authorized or responsible for clean-ups
Galaxy – A potential solution
Galaxy is an open-source software that could have helped with managing
projects and tacking meta data for reproducibility
Great concept – tracked files, meta data, and software versions used
Unfortunately, the security and manageability was not enterprise ready
CHTC jobs would want to run as root
We couldn’t lock down users' data from other users' jobs
WID DataVault: 2017
The DiscoverIT team and Em Craft put their heads together to solve the
numerous issues in managing the data lifecycle @ Discovery
WID broke off from the old cluster and bought new Dell 720XD storage servers with expansion
chassis
An additional server was also added for home directories (not available on the 1
st
 system)
A redmine ticketing system was developed to hold all the Metadata to enable tracking the
lifecycle research data
Including authorized users in the project allowed for the use of scripts to automatically
provision the folders and ACL’s for the projects on the production servers and also space for the
backup servers as well
Morgridge DataVault: 2023
Andrew Maier took over the role of data evangelist for the Morgridge DataVault.
Morgridge was implementing a new shared distributed file system, so this was the perfect
opportunity to promote the use of the DataVault concept as it was tried and tested with WID
The Morgridge DataVault was built on super-micro hardware using a mix of spinning disk and NVME
storage
The Distributed File System chosen this time was Ceph using erasure coding for redundancy
The DataVault was installed on a separate network that is a part of the CHTC block of IP Addresses
Traditionally, Morgridge storage systems are installed on the Morgridge network behind the firewall
Morgridge DataVault: 2024
Access to the DataVault was made accessible to both the UW NetID system of
user accounts and also the Discovery Active Directory Accounts
Making sure UIDs didn’t overlap was the biggest technical challenge
This allowed for Morgridge shared storage to have access to CHTC HTCondor
without a firewall in-between taking advantage of the full network bandwidth
DataVault Projects
Definitions
PI
: Principal Investigator
Project Data
: data files, software,
documents
Project Metadata
: who, what, when,
where, why
Project Completion
: UW finished
Closeout, PI/lab no longer need the
data 🤞
Lifecycle
1)
Initialize
2)
Use and Update
3)
Archive and Remove
DataVault Projects: Initialize
The PI (or Lab Manager) defines the metadata (
Redmine
)
[PI Name]-[Project Name]
Size in GB
Access List
Extras
DiscoverIT Server Team initializes the Project
Create groups and folders in DataVault (
Ceph
, 
Python
, 
Active Directory
)
Set initial quota (
setfattr
 for Ceph, 
xfs_quota
 for xfs)
Configure and maintain backups (
rsnapshot
)
Update metadata and notify the PI
DataVault Projects: Use and Update
Project Members create and access data
Supported compute servers: domain-level connection (
NFS
, 
kernel driver
)
Instruments, equipment, and other computers: user-level connection (
SMB
)
Globus endpoints available
DiscoverIT Server Team and PIs maintain the project
Access managed through on/off-boarding or one-off (
Redmine
, 
ManageEngine
,
Freshworks
)
Access and usage audited each semester (
Access
, 
PowerShell
, and a LOT of 
email
)
Review usage graphs and reports (
Grafana
 and 
Diskover Data
)
Size and inode notifications (
Icinga
)
Fix Unix file permissions as needed (
chmod
)
DataVault Projects: Archive and Remove
After Project Completion, DiscoverIT moves the data to
DoIT’s Storage and updates the metadata. (
Globus
, 
s3
)
DiscoverIT updates the metadata and can delete the data.
Flexible Implementation
DataVault Project Types
Standard
PI-ProjectName
PI-RawData
WID-Software
Shared
Other Data Storage
Home Directory
Local Storage
Removable Storage
CHTC Storage
Research Drive
Campus-Licensed Cloud
Storage
Project
Shared
Student B
Student A
PI
B
A
Slide Note
Embed
Share

The journey of managing research data storage at Morgridge Institute for Research and Wisconsin Institute for Discovery, including challenges faced, solutions considered, and steps taken to improve data management practices. From initial storage decisions to the development of the DataVault and improvements in data lifecycle management, this narrative sheds light on the evolution of data storage strategies in a research setting.

  • Data management
  • Research projects
  • Data storage
  • Morgridge Institute
  • Wisconsin Institute

Uploaded on Feb 22, 2025 | 4 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. DataVaut @Discovery Managing Data Storage for Research Projects Derek Cooper Ph.D., Morgridge Institute for Research Em Craft, Wisconsin Institute for Discovery, UW-Madison

  2. Development of the DataVault The Discovery Building opened in 2010 Morgridge and WID Researchers needed disk space right away Considered Gluster and Luster for a shared Distributed File System Gluster had recently been purchased by Red Hat Luster @ Ice Cube had recently crashed was rebuilt from scratch We chose Gluster for ease of management, hopeful stability, and thought that it was a good bet that Red Hat would do well by the open-source project

  3. Discovery Data Management: 2010-2016 We surveyed the labs for Meta Data to help manage the data That failed miserably as none of the PI s would fill out the paperwork Data was grouped by Department Department-level quotas within Organization-level quotas Became scattered without knowing who owned it One person (or CHTC job) in one lab could fill the volume for the entire department Cleanup and archive was impossible as no one knew what or when to archive Project files ended up in multiple locations There was no tracking of software or raw files used for reproducibility Research Data was sometimes stored in Home Directories and students graduate No one person was authorized or responsible for clean-ups

  4. Galaxy A potential solution Galaxy is an open-source software that could have helped with managing projects and tacking meta data for reproducibility Great concept tracked files, meta data, and software versions used Unfortunately, the security and manageability was not enterprise ready CHTC jobs would want to run as root We couldn t lock down users' data from other users' jobs

  5. WID DataVault: 2017 The DiscoverIT team and Em Craft put their heads together to solve the numerous issues in managing the data lifecycle @ Discovery WID broke off from the old cluster and bought new Dell 720XD storage servers with expansion chassis An additional server was also added for home directories (not available on the 1stsystem) A redmine ticketing system was developed to hold all the Metadata to enable tracking the lifecycle research data Including authorized users in the project allowed for the use of scripts to automatically provision the folders and ACL s for the projects on the production servers and also space for the backup servers as well

  6. Morgridge DataVault: 2023 Andrew Maier took over the role of data evangelist for the Morgridge DataVault. Morgridge was implementing a new shared distributed file system, so this was the perfect opportunity to promote the use of the DataVaultconcept as it was tried and tested with WID The Morgridge DataVaultwas built on super-micro hardware using a mix of spinning disk and NVME storage The Distributed File System chosen this time was Ceph using erasure coding for redundancy The DataVaultwas installed on a separate network that is a part of the CHTC block of IP Addresses Traditionally, Morgridge storage systems are installed on the Morgridge network behind the firewall

  7. Morgridge DataVault: 2024 Access to the DataVault was made accessible to both the UW NetID system of user accounts and also the Discovery Active Directory Accounts Making sure UIDs didn t overlap was the biggest technical challenge This allowed for Morgridge shared storage to have access to CHTC HTCondor without a firewall in-between taking advantage of the full network bandwidth

  8. DataVault Projects Definitions Lifecycle PI: Principal Investigator Project Data: data files, software, documents Project Metadata: who, what, when, where, why Project Completion: UW finished Closeout, PI/lab no longer need the data 1) 2) 3) Initialize Use and Update Archive and Remove

  9. DataVault Projects: Initialize The PI (or Lab Manager) defines the metadata (Redmine) [PI Name]-[Project Name] Size in GB Access List Extras DiscoverIT Server Team initializes the Project Create groups and folders in DataVault (Ceph, Python, Active Directory) Set initial quota (setfattr for Ceph, xfs_quota for xfs) Configure and maintain backups (rsnapshot) Update metadata and notify the PI

  10. DataVault Projects: Use and Update Project Members create and access data Supported compute servers: domain-level connection (NFS, kernel driver) Instruments, equipment, and other computers: user-level connection (SMB) Globus endpoints available DiscoverIT Server Team and PIs maintain the project Access managed through on/off-boarding or one-off (Redmine, ManageEngine, Freshworks) Access and usage audited each semester (Access, PowerShell, and a LOT of email) Review usage graphs and reports (Grafana and Diskover Data) Size and inode notifications (Icinga) Fix Unix file permissions as needed (chmod)

  11. DataVault Projects: Archive and Remove After Project Completion, DiscoverIT moves the data to DoIT s Storage and updates the metadata. (Globus, s3) DiscoverIT updates the metadata and can delete the data.

  12. Flexible Implementation DataVault Project Types Other Data Storage Home Directory Local Storage Removable Storage CHTC Storage Research Drive Campus-Licensed Cloud Storage Standard PI-ProjectName PI-RawData WID-Software Shared A PI B Student A Student B Shared Project

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#