Cloud-Optimized HDF5 Files for Efficient Data Access

Slide Note

Explore the benefits and features of Cloud-Optimized HDF5 files, such as minimal reformatting, fast content scanning, and efficient data access for both cloud-native and conventional applications. Learn about chunk sizes, variable-length datatypes, internal file metadata, and best practices for optimizing data access in cloud object stores.

elissa Follow

Uploaded on May 10, 2024 | 4 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Cloud-Optimized HDF5 Files 2023 ESIP Summer Meeting Aleksandar Jelenak NASA EED-3/HDF Group ajelenak@hdfgroup.org This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001. This document does not contain technology or Technical Data controlled under either the U.S. International Traffic in Arms Regulations or the U.S. Export Administration Regulations. < >

Cloud Optimized HDF5*File An HDF5 file where internal structures are arranged for more efficient data access when in cloud object stores. *Hierarchical Data Format Version 5 2 < >

Cloud Optimized Means One file = one cloud store object. Read-only data access. Data reading based on HTTP*range GET requests with specific file offsets and size of bytes. *Hypertext Transfer Protocol 3 < >

Why Cloud-Optimized HDF5 Files? Least amount of reformatting from archive tapes to cloud object stores. HDF5 library instead of custom HDF5 file format readers with limited capabilities. Fast content scanning when files are in object store. Efficient data access for both cloud-native and conventional applications. 4 < >

What Makes a Cloud-Optimized HDF5 File? Large dataset chunk size (2-16 MiB*). Minimal use of variable-length datatypes. Combined internal file metadata. *mebibyte (1,048,576 bytes) 5 < >

Large Dataset Chunk Sizes Chunk size is a product of the dataset s datatype size in bytes and the chunk s shape (number of elements for each dimension). AWS* best practice document claims one HTTP range GET request should be 8 16 MiB. *Amazon Web Service 6 < >

Large Dataset Chunk Sizes (cont.) Chunk shape still needs to account for likely data access patterns. Default HDF5 library dataset chunk cache only 1 MiB but is configurable per dataset. Chunk cache size can have significant impact on I/O performance. Trade-off between chunk size and compression/decompression speed. 7 < >

How to Set Chunk Cache Size? HDF5 library H5Pset_cache() for all datasets in a file H5Pset_chunk_cache() for individual datasets h5py h5py.File class for all datasets in a file h5py.Group.require_dataset() method for individual datasets netCDF*library nc_set_chunk_cache() for all variables in a file nc_get_var_chunk_cache() for individual variables 8 *Network Common Data Form < >

Variable-Length Datatypes Current implementation of variable-length (vlen) data in HDF5 files prevents easy retrieval using HTTP range GET requests. Alternative access methods require duplicating vlen data outside its file or custom HDF5 file format reader. Minimize use of these datatypes in HDF5 datasets if not using the HDF5 library for cloud data access. 9 < >

Combined Internal File Metadata Only one data read needed to learn about a file s content (what s in it, and where it is). Important for all use cases that require file s content prior any data reads. The HDF5 library by default spreads internal file metadata in small blocks throughout the file. 10 < >

How to Combine Internal File Metadata? 1. Create files with Paged Aggregation file space management strategy. 2. Create files with increased file metadata block. 3. Store file s content information in it s User Block. 11 < >

HDF5 Paged Aggregation One of available file space management strategies. Not the default. Can only be set at file creation. Best suited when file content is added once and never modified. The library will always read and write in file pages. 12 < >

HDF5 Paged Aggregation (contd) Internal file metadata and raw data are organized in separate pages of specified size. Setting an appropriate page size can have all internal file metadata in just one page. Only page aggregated files can use library s page buffer cache that can significantly reduce subsequent data access. 13 < >

How to Create Paged Aggregated Files HDF5 library: H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE, ) H5Pset_file_space_page_size(fcpl, page_size) fcpl: file creation property list h5py h5py.File class netCDF library Not supported yet. 14 < >

How to Apply Paged Aggregation to Existing Files $ h5repack S PAGE G PAGE_SIZE_BYTES in.h5 out.h5 15 < >

File Metadata Block Size Applies to non-page aggregated files. Internal metadata stored in metadata blocks in a file. Default size is 2048 bytes. Setting a bigger block at file creation can combine internal metadata into a single contiguous block. Recommend to make the block size at equal the entire internal file metadata. 16 < >

How to Create Files with Larger Metadata Block HDF5 library: H5Pset_meta_block_size(fapl, block_size, ) fapl: file access property list h5py h5py.File class netCDF library Not supported yet. 17 < >

How to Increase Metadata Block for Existing Files $ h5repack -metadata_block_size SIZE_BYTES in.h5 out.h5 18 < >

How to Find Internal File Metadata Size? $ h5stat -S ATL03_20190928165055_00270510_004_01.h5 Filename: ATL03_20190928165055_00270510_004_01.h5 File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR File space page size: 4096 bytes Summary of file space information: File metadata: 7713028 bytes Raw data: 2458294886 bytes Amount/Percent of tracked free space: 0 bytes/0.0% Unaccounted space: 91670 bytes Total space: 2466099584 bytes 19 < >

File Content Info in User Block User block is a block of bytes at the beginning of an HDF5 file that the library skips so any user application content can be stored. Extracted file content information can be stored in the user block to be readily available later with a single data read request. This new info stays with the file still one cloud store object. 20 < >

How to Create File with User Block HDF5 library H5Pset_userblock() h5py h5py.File class netCDF library Not supported yet. 21 < >

How to Add User Block to Existing Files $ h5repack --block=SIZE_BYTES --ublock=user_block.file in.h5 out.h5 or $ h5jam -i in.h5 -u user_block.file -o out.h5 IMPORTANT: The user block content must be extracted after the user block has been added if interested in dataset chunk file locations. Current HDF5 functions for chunk file locations have a bug so user block must be added. 22 < >

Wrap-Up HDF5 library can create cloud-optimized HDF5 files if instructed. Larger dataset chunk sizes are the most important data access optimization. Combining internal metadata is required if discovering file content when files are already in object store. The two optimizations are not related. 23 < >

Wrap-Up (contd) Page aggregated files are recommended if HDF5 library will also be used for cloud data access. DMR++ is recommended for file content description stored in file user block. Cloud-optimize HDF5 files prior to transfer to object store. Best is to create such files and avoid any post-optimization. 24 < >