Cloud-Optimized HDF5 Files Overview

Slide Note

Explore the concept of cloud-optimized HDF5 files, including Cloud-Optimized Storage Format, Cloud Native Storage Format, and the benefits of using HDF5 in cloud environments. Learn about key strategies like Paged Aggregation, chunk size optimization, and variable-length datatypes considerations to enhance data access efficiency. Discover why Cloud-Optimized HDF5 is beneficial for transitioning data from archive tapes to cloud object stores effectively.

gianni Follow

Uploaded on Aug 14, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Toward Cloud-Optimized HDF5 Files 2023 ESIP Winter Meeting Aleksandar Jelenak NASA EED-3/HDF Group ajelenak@hdfgroup.org This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001. This document does not contain technology or Technical Data controlled under either the U.S. International Traffic in Arms Regulations or the U.S. Export Administration Regulations.

Cloud Optimized Storage Format An already existing file format where internal structures are rearranged for more efficient data access when in cloud object stores. 2

Cloud Native Storage Format A data format specifically developed to take advantage of the distributed nature of cloud computing and the key-value interface of object storage systems. Counts as cloud optimized as well. 3

Cloud Native One file consists of many objects in cloud store. Data reads access entire objects. Data modifications/additions are allowed. 4

Cloud Optimized One file is one object. Read-only data access. Any update overwrites (replaces) that file (cloud store object). Data reads access parts of that object (HTTP*range GET requests). *Hypertext Transfer Protocol 5

Why Cloud-Optimized HDF5*? Least amount of reformatting from archive tapes to cloud object stores. Fast content scanning when files are in an object store. HDF5 library instead of a custom limited- feature HDF5 file format reader. *Hierarchical Data Format Version 5 6

Properties of Cloud-Optimized HDF5 Created with Paged Aggregation as the file space management strategy. Large dataset chunk size (2-16 MiB*). Chunk compression method have optimal performance for chosen chunk size. Minimal use of variable-length datatypes. (Important when not using HDF5 library.) *mebibyte (1,048,576 bytes) 7

Variable-Length Datatypes Current implementation of variable-length data in HDF5 files prevents easy retrieval using HTTP range GET requests. Avoid these datatypes if not using the HDF5 library. Alternative access methods requires duplicating vlen data outside its file. 8

Large Dataset Chunk Sizes Chunk size is a product of the dataset s datatype size in bytes and the chunk s number of elements (a product of chunk s dimensions). AWS* best practice claims one HTTP range GET request should be 8 16 MiB. Total number of chunk elements can be calculated from the optimal request size. *Amazon Web Service 9

Large Dataset Chunk Sizes (cont.) The actual chunk shape still needs to account expected data access patterns. Default HDF5 library dataset chunk cache only 1 MiB but is configurable per dataset. Appropriate chunk cache size has significant impact on I/O performance. Trade-off between chunk size and compression/decompression speed. 10

HDF5 Paged Aggregation One of available file space management strategies. Not the default. Can only be set at file creation. Best suited when file content is added once and never modified. 11

HDF5 Paged Aggregation (contd) The library always reads and writes file pages. Internal file metadata and raw data are organized in separate pages of specified size. Setting an appropriate page size can have all internal file metadata in just one page. (Great for optimized cloud access.) 12

How to Apply Paged Aggregation Existing files: $ h5repack S PAGE G PAGE_SIZE_BYTES in.h5 out.h5 New files: H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE, ) H5Pset_file_space_page_size(fcpl, page_size) fcpl: file creation property list variable 13

HDF5 Page Buffering Low-level library cache for file metadata and raw data pages. Only available for files created with paged aggregation. Page buffer size must be an exact multiple of the file s page size. 14

Case Study ICESat-2*ATL03 product. File size: 2,458,294,886 bytes (2.29 GiB**) 171 HDF5 groups 1,001 HDF5 datasets HDF5 file metadata size: 7,713,028 bytes Repack the original file using Page Aggregation with two file page sizes: 4MiB (PAGE4MiB) and 8 MiB (PAGE8MiB). The PAGE4MiB file will have two pages and the PAGE8MiB file will have one page with the internal file metadata, respectively. *Ice, Cloud, and Land Elevation Satellite-2 **gibibyte (1,073,741,824 bytes) 15

How to Find Internal File Metadata Size? $ h5stat -S ATL03_20190928165055_00270510_004_01.h5 Filename: ATL03_20190928165055_00270510_004_01.h5 File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR File space page size: 4096 bytes Summary of file space information: File metadata: 7713028 bytes Raw data: 2458294886 bytes Amount/Percent of tracked free space: 0 bytes/0.0% Unaccounted space: 91670 bytes Total space: 2466099584 bytes 16

Case Study (contd) With the three files in Amazon S3 (Simple Storage Service), list file content and dataset chunk file locations. A common task to enable alternative access to HDF5 data. HDF Scalable Data Service (HSDS) OPeNDAP (Open-source Project for a Network Data Access Protocol) DMR++ (Dataset Metadata Response++) kerchunk (xarray/Zarr ecosystem) Test using h5py 3.7.0 and HDF5-1.13.1 with Read-Only S3 Virtual File Driver (VFD). 17

Case Study (contd) Computing environment: AWS EC2 m5.xlarge EC2 and S3 resources in the same AWS region Python 3.11, h5py 3.7.0, libhdf5-1.13.1 libhdf5 for local access libhdf5 with Read-Only S3 virtual file driver (ros3) for S3 access libhdf5 with s3fs Python package for S3 access 18

Repack Runtime and File Size Change File Space Page Size (MiB) Repack Time (s) File Size Change (%) 4 8 19.4 19.6 19.5 19.6 19.7 +0.18 +0.35 +0.69 +2.05 +3.41 16 32 64 19

Perf. Results Local File System File Version Page Buffer Size Runtime (s) Perf. Ratio vs Baseline Orig. n/a 4 MiB 8 MiB 8 MiB 8.8912 9.3508 8.9792 8.9273 1 PAGE4MiB PAGE4MiB PAGE8MiB 1.0517 1.0099 1.0041 20

Perf. Results S3 with libhdf5 + ros3 VFD File Version Page Buffer Size n/a 4 MiB 8 MiB 8 MiB Runtime (s) Perf. Ratio vs Local Baseline 17.4242 5.8463 1.0244 1.0281 Orig. 154.922 51.9803 9.1082 9.141 PAGE4MiB PAGE4MiB PAGE8MiB 21

Perf. Results S3 with libhdf5 + s3fs Python package File Version Page Buffer Size s3fs Block Size Runtime (s) Perf. Ratio vs Local Baseline 21.7057 12.0169 0.9872 0.9996 85.8931 0.9901 0.9901 Orig. n/a 4 MiB 8 MiB 8 MiB n/a 4 MiB 8 MiB 5 MiB 5 MiB 5 MiB 5 MiB 50 MiB 50 MiB 50 MiB 192.9893 106.8446 8.7776 8.8883 763.6914 8.8035 8.8035 PAGE4MiB PAGE4MiB PAGE8MiB Orig. PAGE4MiB PAGE8MiB 22

Wrap-Up HDF5 library supports combining file metadata and raw data bytes into separate internal file pages of configurable size. Only available to new files. Existing files must be repacked. Using HDF5 page buffer cache when reading such files from cloud object store can significantly improve performance. 23

Wrap-Up (contd) Page Buffer size must be at least the size of total file metadata. Like with any other cache: The more, the merrier! Page buffer statistics is available for fine tuning. 24