Cloud-Optimized HDF5 Files for Efficient Data Access

 
Cloud-Optimized HDF5 Files
 
2023 ESIP Summer Meeting
 
This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001.
T
h
i
s
 
d
o
c
u
m
e
n
t
 
d
o
e
s
 
n
o
t
 
c
o
n
t
a
i
n
 
t
e
c
h
n
o
l
o
g
y
 
o
r
 
T
e
c
h
n
i
c
a
l
 
D
a
t
a
 
c
o
n
t
r
o
l
l
e
d
 
u
n
d
e
r
 
e
i
t
h
e
r
 
t
h
e
 
U
.
S
.
 
I
n
t
e
r
n
a
t
i
o
n
a
l
 
T
r
a
f
f
i
c
i
n
 
A
r
m
s
 
R
e
g
u
l
a
t
i
o
n
s
 
o
r
 
t
h
e
 
U
.
S
.
 
E
x
p
o
r
t
 
A
d
m
i
n
i
s
t
r
a
t
i
o
n
 
R
e
g
u
l
a
t
i
o
n
s
.
 
Aleksandar Jelenak
NASA EED-3/HDF Group
ajelenak@hdfgroup.org
 
 
An HDF5 file where internal structures are
arranged for more efficient data access
when in cloud object stores.
 
Cloud Optimized HDF5
*
 File
 
*Hierarchical Data Format Version 5
 
One file = one cloud store object.
Read-only data access.
Data reading based on HTTP
*
 range GET
requests with specific file offsets and size
of bytes.
 
Cloud Optimized Means…
 
*Hypertext Transfer Protocol
 
Least amount of reformatting from archive
tapes to cloud object stores.
HDF5 library instead of custom HDF5 file
format readers with limited capabilities.
Fast content scanning when files are in
object store.
Efficient data access for both cloud-native
and conventional applications.
 
Why Cloud-Optimized HDF5 Files?
 
 
Large dataset chunk size (2-16 MiB
*
).
Minimal use of variable-length datatypes.
Combined internal file metadata.
 
What Makes a Cloud-Optimized
HDF5 File?
 
*
mebibyte (1,048,576 bytes)
 
Chunk 
size
 is a product of the dataset’s
datatype size in bytes and the chunk’s
shape
 (number of elements for each
dimension).
AWS
*
 best practice document claims one
HTTP range GET request should be 8–16
MiB.
 
Large Dataset Chunk Sizes
 
*
Amazon Web Service
 
Chunk shape still needs to account for
likely data access patterns.
Default HDF5 library 
dataset chunk cache
only 1 MiB but is configurable 
per dataset
.
Chunk cache size can have significant
impact on I/O performance.
Trade-off between chunk size and
compression/decompression speed.
 
Large Dataset Chunk Sizes 
(cont.)
 
HDF5 library
H5Pset_cache()
 for all datasets in a file
H5Pset_chunk_cache()
 for individual datasets
h5py
h5py.File
 class for all datasets in a file
h5py.Group.require_dataset()
 method for
individual datasets
netCDF
*
 library
nc_set_chunk_cache()
 for all variables in a file
nc_get_var_chunk_cache()
 for individual
variables
 
How to Set Chunk Cache Size?
 
*Network Common Data Form
 
Current implementation of variable-length
(vlen) data in HDF5 files prevents easy
retrieval using HTTP range GET requests.
Alternative access methods require
duplicating vlen data outside its file or
custom HDF5 file format reader.
M
i
n
i
m
i
z
e
 
u
s
e
 
o
f
 
t
h
e
s
e
 
d
a
t
a
t
y
p
e
s
 
i
n
 
H
D
F
5
d
a
t
a
s
e
t
s
 
i
f
 
n
o
t
 
u
s
i
n
g
 
t
h
e
 
H
D
F
5
 
l
i
b
r
a
r
y
 
f
o
r
c
l
o
u
d
 
d
a
t
a
 
a
c
c
e
s
s
.
 
Variable-Length Datatypes
 
Only one data read needed to learn about
a file’s content (what’s in it, and where it
is).
Important for all use cases that require
file’s content prior any data reads.
T
h
e
 
H
D
F
5
 
l
i
b
r
a
r
y
 
b
y
 
d
e
f
a
u
l
t
 
s
p
r
e
a
d
s
i
n
t
e
r
n
a
l
 
f
i
l
e
 
m
e
t
a
d
a
t
a
 
i
n
 
s
m
a
l
l
 
b
l
o
c
k
s
t
h
r
o
u
g
h
o
u
t
 
t
h
e
 
f
i
l
e
.
 
Combined Internal File Metadata
 
1.
Create files with Paged Aggregation file
space management strategy.
2.
Create files with increased file metadata
block.
3.
Store file’s content information in it’s User
Block.
 
How to Combine Internal File
Metadata?
 
O
n
e
 
o
f
 
a
v
a
i
l
a
b
l
e
 
f
i
l
e
 
s
p
a
c
e
 
m
a
n
a
g
e
m
e
n
t
s
t
r
a
t
e
g
i
e
s
.
 
N
o
t
 
t
h
e
 
d
e
f
a
u
l
t
.
Can only be set at file creation.
Best suited when file content is added once
and never modified.
The library will always read and write in file
pages.
 
HDF5 Paged Aggregation
 
Internal file metadata and raw data are
organized in separate 
pages
 of specified
size.
Setting an appropriate page size can have
all internal file metadata in just one page.
Only page aggregated files can use library’s
page buffer cache that can significantly
reduce subsequent data access.
 
HDF5 Paged Aggregation 
(cont’d)
 
HDF5 library:
   
H5Pset_file_space_strategy
(fcpl, H5F_FSPACE_STRATEGY_PAGE,…)
   
H5Pset_file_space_page_size
(fcpl, page_size)
  fcpl
: file creation property list
h5py
h5py.File
 class
netCDF library
Not supported yet.
 
How to Create Paged Aggregated Files
 
 
$ h5repack –S PAGE –G PAGE_SIZE_BYTES in.h5 out.h5
 
How to Apply Paged Aggregation to
Existing Files
 
Applies to non-page aggregated files.
Internal metadata stored in metadata
blocks in a file. Default size is 2048 bytes.
Setting a bigger block at file creation can
combine internal metadata into a single
contiguous block.
Recommend to make the block size at
equal the entire internal file metadata.
 
File Metadata Block Size
 
HDF5 library:
   
H5Pset_meta_block_size
(fapl, block_size,…)
fapl
: file access property list
h5py
h5py.File
 class
netCDF library
Not supported yet.
 
How to Create Files with Larger
Metadata Block
 
 
$ h5repack –-metadata_block_size SIZE_BYTES in.h5 out.h5
 
How to Increase Metadata Block for
Existing Files
 
How to Find Internal File Metadata
Size?
 
$ 
h5stat
 -S ATL03_20190928165055_00270510_004_01.h5
Filename: ATL03_20190928165055_00270510_004_01.h5
File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR
File space page size: 4096 bytes
Summary of file space information:
File metadata: 7713028 bytes
Raw data: 2458294886 bytes
Amount/Percent of tracked free space: 0 bytes/0.0%
Unaccounted space: 91670 bytes
Total space: 2466099584 bytes
 
User block is a block of bytes at the
beginning of an HDF5 file that the library
skips so any user application content can
be stored.
Extracted file content information can be
stored in the user block to be readily
available later with a single data read
request.
This new info stays with the file – still one
cloud store object.
 
File Content Info in User Block
 
HDF5 library
H5Pset_userblock()
h5py
h5py.File
 class
netCDF library
Not supported yet.
 
How to Create File with User Block
 
 
$ h5repack --block=SIZE_BYTES --ublock=user_block.file in.h5 out.h5
or
$ h5jam -i in.h5 -u user_block.file -o out.h5
 
I
M
P
O
R
T
A
N
T
:
 
T
h
e
 
u
s
e
r
 
b
l
o
c
k
 
c
o
n
t
e
n
t
 
m
u
s
t
 
b
e
e
x
t
r
a
c
t
e
d
 
a
f
t
e
r
 
t
h
e
 
u
s
e
r
 
b
l
o
c
k
 
h
a
s
 
b
e
e
n
 
a
d
d
e
d
 
i
f
i
n
t
e
r
e
s
t
e
d
 
i
n
 
d
a
t
a
s
e
t
 
c
h
u
n
k
 
f
i
l
e
 
l
o
c
a
t
i
o
n
s
.
Current HDF5 functions for chunk file locations
have a bug so user block must be added.
 
How to Add User Block to Existing
Files
 
HDF5 library can create cloud-optimized
HDF5 files if instructed.
Larger dataset chunk sizes are the most
important data access optimization.
Combining internal metadata is required if
discovering file content when files are
already in object store.
The two optimizations are not related.
 
Wrap-Up
 
Page aggregated files are recommended if
HDF5 library will also be used for cloud data
access.
DMR++ is recommended for file content
description stored in file user block.
Cloud-optimize HDF5 files prior to transfer to
object store. Best is to create such files and
avoid any post-optimization.
 
Wrap-Up 
(cont’d)
 
This work was supported by NASA/GSFC under
Raytheon Technologies contract number
80GSFC21CA001.
 
Thank you!
Slide Note
Embed
Share

Explore the benefits and features of Cloud-Optimized HDF5 files, such as minimal reformatting, fast content scanning, and efficient data access for both cloud-native and conventional applications. Learn about chunk sizes, variable-length datatypes, internal file metadata, and best practices for optimizing data access in cloud object stores.

  • Cloud-Optimized HDF5
  • Efficient Data Access
  • Chunk Sizes
  • Cloud Object Stores
  • Data Optimization

Uploaded on May 10, 2024 | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Cloud-Optimized HDF5 Files 2023 ESIP Summer Meeting Aleksandar Jelenak NASA EED-3/HDF Group ajelenak@hdfgroup.org This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001. This document does not contain technology or Technical Data controlled under either the U.S. International Traffic in Arms Regulations or the U.S. Export Administration Regulations. < >

  2. Cloud Optimized HDF5*File An HDF5 file where internal structures are arranged for more efficient data access when in cloud object stores. *Hierarchical Data Format Version 5 2 < >

  3. Cloud Optimized Means One file = one cloud store object. Read-only data access. Data reading based on HTTP*range GET requests with specific file offsets and size of bytes. *Hypertext Transfer Protocol 3 < >

  4. Why Cloud-Optimized HDF5 Files? Least amount of reformatting from archive tapes to cloud object stores. HDF5 library instead of custom HDF5 file format readers with limited capabilities. Fast content scanning when files are in object store. Efficient data access for both cloud-native and conventional applications. 4 < >

  5. What Makes a Cloud-Optimized HDF5 File? Large dataset chunk size (2-16 MiB*). Minimal use of variable-length datatypes. Combined internal file metadata. *mebibyte (1,048,576 bytes) 5 < >

  6. Large Dataset Chunk Sizes Chunk size is a product of the dataset s datatype size in bytes and the chunk s shape (number of elements for each dimension). AWS* best practice document claims one HTTP range GET request should be 8 16 MiB. *Amazon Web Service 6 < >

  7. Large Dataset Chunk Sizes (cont.) Chunk shape still needs to account for likely data access patterns. Default HDF5 library dataset chunk cache only 1 MiB but is configurable per dataset. Chunk cache size can have significant impact on I/O performance. Trade-off between chunk size and compression/decompression speed. 7 < >

  8. How to Set Chunk Cache Size? HDF5 library H5Pset_cache() for all datasets in a file H5Pset_chunk_cache() for individual datasets h5py h5py.File class for all datasets in a file h5py.Group.require_dataset() method for individual datasets netCDF*library nc_set_chunk_cache() for all variables in a file nc_get_var_chunk_cache() for individual variables 8 *Network Common Data Form < >

  9. Variable-Length Datatypes Current implementation of variable-length (vlen) data in HDF5 files prevents easy retrieval using HTTP range GET requests. Alternative access methods require duplicating vlen data outside its file or custom HDF5 file format reader. Minimize use of these datatypes in HDF5 datasets if not using the HDF5 library for cloud data access. 9 < >

  10. Combined Internal File Metadata Only one data read needed to learn about a file s content (what s in it, and where it is). Important for all use cases that require file s content prior any data reads. The HDF5 library by default spreads internal file metadata in small blocks throughout the file. 10 < >

  11. How to Combine Internal File Metadata? 1. Create files with Paged Aggregation file space management strategy. 2. Create files with increased file metadata block. 3. Store file s content information in it s User Block. 11 < >

  12. HDF5 Paged Aggregation One of available file space management strategies. Not the default. Can only be set at file creation. Best suited when file content is added once and never modified. The library will always read and write in file pages. 12 < >

  13. HDF5 Paged Aggregation (contd) Internal file metadata and raw data are organized in separate pages of specified size. Setting an appropriate page size can have all internal file metadata in just one page. Only page aggregated files can use library s page buffer cache that can significantly reduce subsequent data access. 13 < >

  14. How to Create Paged Aggregated Files HDF5 library: H5Pset_file_space_strategy(fcpl, H5F_FSPACE_STRATEGY_PAGE, ) H5Pset_file_space_page_size(fcpl, page_size) fcpl: file creation property list h5py h5py.File class netCDF library Not supported yet. 14 < >

  15. How to Apply Paged Aggregation to Existing Files $ h5repack S PAGE G PAGE_SIZE_BYTES in.h5 out.h5 15 < >

  16. File Metadata Block Size Applies to non-page aggregated files. Internal metadata stored in metadata blocks in a file. Default size is 2048 bytes. Setting a bigger block at file creation can combine internal metadata into a single contiguous block. Recommend to make the block size at equal the entire internal file metadata. 16 < >

  17. How to Create Files with Larger Metadata Block HDF5 library: H5Pset_meta_block_size(fapl, block_size, ) fapl: file access property list h5py h5py.File class netCDF library Not supported yet. 17 < >

  18. How to Increase Metadata Block for Existing Files $ h5repack -metadata_block_size SIZE_BYTES in.h5 out.h5 18 < >

  19. How to Find Internal File Metadata Size? $ h5stat -S ATL03_20190928165055_00270510_004_01.h5 Filename: ATL03_20190928165055_00270510_004_01.h5 File space management strategy: H5F_FSPACE_STRATEGY_FSM_AGGR File space page size: 4096 bytes Summary of file space information: File metadata: 7713028 bytes Raw data: 2458294886 bytes Amount/Percent of tracked free space: 0 bytes/0.0% Unaccounted space: 91670 bytes Total space: 2466099584 bytes 19 < >

  20. File Content Info in User Block User block is a block of bytes at the beginning of an HDF5 file that the library skips so any user application content can be stored. Extracted file content information can be stored in the user block to be readily available later with a single data read request. This new info stays with the file still one cloud store object. 20 < >

  21. How to Create File with User Block HDF5 library H5Pset_userblock() h5py h5py.File class netCDF library Not supported yet. 21 < >

  22. How to Add User Block to Existing Files $ h5repack --block=SIZE_BYTES --ublock=user_block.file in.h5 out.h5 or $ h5jam -i in.h5 -u user_block.file -o out.h5 IMPORTANT: The user block content must be extracted after the user block has been added if interested in dataset chunk file locations. Current HDF5 functions for chunk file locations have a bug so user block must be added. 22 < >

  23. Wrap-Up HDF5 library can create cloud-optimized HDF5 files if instructed. Larger dataset chunk sizes are the most important data access optimization. Combining internal metadata is required if discovering file content when files are already in object store. The two optimizations are not related. 23 < >

  24. Wrap-Up (contd) Page aggregated files are recommended if HDF5 library will also be used for cloud data access. DMR++ is recommended for file content description stored in file user block. Cloud-optimize HDF5 files prior to transfer to object store. Best is to create such files and avoid any post-optimization. 24 < >

  25. This work was supported by NASA/GSFC under Raytheon Technologies contract number 80GSFC21CA001. Thank you! 25 < >

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#