Enhancing High Performance Data Division with LIOProf for Lustre Systems

undefined
LIOProf: Exposing Lustre File System
Behavior for I/O Middleware
Cong Xu
, Vishwanath Venkatesan,
Omkar Kulkarni, Kalyana Chadalavada
Intel Corporation
Suren Byna
Lawrence Berkeley National Laboratory
Robert Sisneros
National Center for Supercomputing Applications
 
Mohamad Chaarawi
The HDF5 Group
Outline
2
Background and Introduction
Motivation
LIOProf Design and Implementation
Two Case Studies
Improve MPI-IO Performance over Lustre
Address Parallel HDF5 Overhead
Conclusion
I/O  System and Profiling Tools
3
Parallel I/O subsystem is 
complex
Levels of software stacks, hardware layers, various I/O patterns
Detecting performance bottleneck is 
challenging
Address the 
challenge
: profiling tools
Facilitate I/O characterization for I/O activities analysis
Existing
 profiling tools: Darshan, Lustre Monitoring Tool (LMT)
Parallel I/O Stacks
Issues in Legacy Lustre Profiling Tools
4
Limited
 I/O tracing information
CPU utilization, Memory usage and Disk bandwidth etc.
Missing
 correlation info between Lustre clients and servers
Need to uncover how application I/O requests correlate with file system activities
Lustre RPC traces provide us this information
Lustre Monitoring Tool Snapshot
Lustre RPC Tracing
5
Analyze Lustre RPC trace logs
Clients I/O Requests on OSS nodes
I/O workload distribution
Lock Contention
MDT
OST0
Read at time 0:12
Read at time 0:25
Read at time 0:12
Read at time 0:06
OST1
OSS0
OSS1
MDS
Client0
Client1
Client2
LIOProf
: 
L
ustre 
IO
 
P
rofiler
6
Logging Services
Enable RPC tracing to record the I/O activities of OSS nodes
Available to super users / administrators
Statistics Collection and Visualization
Collect the statistical metrics and generate visualization plots
Logs can be parsed offline
LIOProf Components
LIOProf Logging Services
7
Enable Lustre 
RPC (Remote Procedure Call) Tracing
Configure parameter “debug” to be rpctrace log level
Employ debug buffer to store RPC tracing logs in memory
Launch background debug_daemon to drain the logs
Overhead
 of LIOProf Logging Services
Compare the efficiency of Benchmarks with/without enabling the LIOProf
The performance difference is less than 1%
LIOProf 
Statistics Collection and 
Visualization
8
Parse
 RPC traces for I/O activities information
Log item 
time
: time that the request has been handled
RPC 
source
: client that issues the request
RPC operation code(
opc
): request type
I/O statistics 
visualization
Gather and organize the parsed output
Create the gnuplot script for visualization
Case 1
: Investigate MPIIO Performance over Lustre
9
IOR Benchmark with MPI-IO API on Wolf Cluster
192 processes
 perform concurrent I/O in an 
interleaved
 access pattern on 
shared
 file
IOR Config 
[FileSize: 768GB, BlockSize: 4MB, TransferSize: 4MB, Aggregators: 4]
Lustre Config 
[4 OSTs, 6 Clients, Stripe Size: 4MB, Stripe Count: 4]
Obdfilter-survey is employed to measure the 
maximum available 
bandwidth
MVAPICH 
read
 operation performs 
54.8% 
worse than obdfilter-survey
Overall Performance of MVAPICH
Use LIOProf to Analyze Server I/O Bandwidth
10
Each OST is able to deliver 
maximum available write
 bandwidth
Read
 bandwidth is distributed to 4 aggregators on clients
However
, the aggregated bandwidth is lower than 
maximum
 bandwidth
MVAPICH cannot obtain Lustre Stripe Info, each aggregator reads from multiple OSTs
Write
Read
Lustre-Aware CB (Collective Buffer) Read Algorithm
11
Obtain Lustre Stripe Info, each aggregator reads from one OST
Lustre-Aware 
read
 algorithm delivers 
near optimal
 bandwidth
Lustre-Aware performs 
104%
 better than original MVAPICH
Each OST serves I/O requests at high ratio
Overall Performance of
Lustre-Aware CB Algorithm
Lustre-Aware Read
Lustre-Aware CB Read Performance on Cori System
12
Use IOR to launch 
128
 to 
4096 
processes on
 128
 Lustre Clients
Read 
16TB
 data from 
96
 Lustre OSTs
Lustre-Aware performs 
134% 
better than MVAPICH at 4096 Processes
Read Performance on Cori System
Case 2
: Measure Parallel HDF5 Overhead
13
After optimization, MPIIO achieves 
maximum 
bandwidth
IOR benchmark with 
HDF5
 and 
MPIIO
 APIs
4 Processes
 perform I/O in an 
interleaved
 access pattern on 
shared
 file
IOR Config
 [FileSize: 512GB, BlockSize: 4MB, TransferSize: 4MB, Aggregators: 4]
Lustre Config 
[4 OSTs, 4 Clients, Stripe Size: 4MB, Stripe Count: 4]
HDF5 performs 
worse
 than MPIIO, especially in 
read
 operation
Comparison between MPIIO and HDF5
Using LIOProf to Reveal HDF5 I/O Activities in Read
14
OSTs deliver 
high
 bandwidth 
constantly 
in MPIIO case
In HDF5 case, each OST services 
both
 
datasets
 and 
metadata
 I/O requests
HDF5 
metadata operations affect 
datasets I/O accesses
MPIIO
HDF5
HDF5 Collective Metadata and Datasets Optimizations
15
Enable HDF5 
Collective Metadata 
write/read in IOR
Additionally, 
open
 
all the datasets 
in the beginning
, 
cache
 datasets 
metadata
 in memory
HDF5-Coll_Meta&DataSet_Opt outperforms HDF5 and HDF5-Coll_Meta by 
175.3% 
and 
65.1%
 in
read
The overhead of metadata operations have been reduced 
significantly
Overall Performance with Both
Collective Metadata and Datasets Optimizations
HDF5-Coll_Meta&DataSet_Opt
Conclusion
16
Propose 
L
ustre 
IO Prof
iler, called 
LIOProf
, to track the I/O activities on Lustre
Server
LIOProf is useful in uncovering correlation info between I/O patterns and
Lustre behavior
Leverage LIOProf in two case studies
Case 1: Design and implement Lustre-aware Collective Buffer Read Algorithm
Case 2: Identify HDF5 overhead and improve its performance
Thank You and Questions
17
Two Case Studies with LIOProf
18
Case 1
: 
Investigate MPI-IO performance over Lustre
Identify issue in MVAPICH read algorithm over Lustre
Implement Lustre-Aware CB (Collective Buffer) Read Algorithm
Case 2
: Measure Parallel HDF5 overhead
Observe considerable performance gap between HDF5 and MPI-IO cases
Address the HDF5 overhead through enabling collective metadata and applying datasets
optimization
Evaluation Environment
19
Wolf Cluster
Each node equips 64GB Memory, 36 CPU Cores
Nodes are connected by Mellanox ConnectX QDR IB
Cori Supercomputer
1,630 compute nodes with 30PB Lustre storage
32 CPU Cores and 128GB Memory per node
Cray Aries high-speed interconnect with Dragonfly topology
Software Configuration
IOR Benchmark MPIIO and HDF5 APIs
MVAPICH2-2.2b, Lustre version 2.7
HDF5-1.9.234(Parallel)
Optimize HDF5 by Collective Metadata Operation
20
Enable HDF5 
Collective Metadata 
write/read in IOR
H5Pset_coll_metadata_write() & H5Pset_all_coll_metadata_ops()
Collective Metadata 
Read
: one process reads metadata and broadcasts
HDF5-Coll_Meta delivers 
175.3%
 higher read bandwidth than HDF5
Overall Performance with
Collective Metadata Optimization
I/O Activities in Read Operation
Slide Note
Embed
Share

Advanced tools like LIOProf are essential for optimizing parallel I/O performance over Lustre systems. This technology enables detailed profiling of I/O activities in complex environments, addressing challenges related to bottleneck detection, legacy tool limitations, and Lustre RPC tracing. By facilitating comprehensive analysis and visualization of I/O metrics, LIOProf enhances efficiency and scalability in high-performance data divisions.

  • Optimization
  • High Performance Computing
  • Lustre Systems
  • I/O Profiling
  • Parallel Processing

Uploaded on Oct 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Cong Xu, Vishwanath Venkatesan, Omkar Kulkarni, Kalyana Chadalavada Intel Corporation Suren Byna Lawrence Berkeley National Laboratory Robert Sisneros National Center for Supercomputing Applications Mohamad Chaarawi The HDF5 Group

  2. Outline Background and Introduction Motivation LIOProf Design and Implementation Two Case Studies Improve MPI-IO Performance over Lustre Address Parallel HDF5 Overhead Conclusion 2 High Performance Data Division

  3. I/O System and Profiling Tools Parallel I/O subsystem is complex Levels of software stacks, hardware layers, various I/O patterns Detecting performance bottleneck is challenging Application HDF5/NetCDF MPI-IO Lustre File System Backend Disks Parallel I/O Stacks Address the challenge: profiling tools Facilitate I/O characterization for I/O activities analysis Existing profiling tools: Darshan, Lustre Monitoring Tool (LMT) 3 High Performance Data Division

  4. Issues in Legacy Lustre Profiling Tools Limited I/O tracing information CPU utilization, Memory usage and Disk bandwidth etc. Lustre Monitoring Tool Snapshot Missing correlation info between Lustre clients and servers Need to uncover how application I/O requests correlate with file system activities Lustre RPC traces provide us this information 4 High Performance Data Division

  5. Lustre RPC Tracing Analyze Lustre RPC trace logs Clients I/O Requests on OSS nodes I/O workload distribution Lock Contention Client1 Client0 Client2 Read at time 0:12 Read at time 0:25 MDS Read at time 0:12 Read at time 0:06 OSS0 OSS1 MDT OST0 OST1 Lustre RPC trace logs OSS0 OSS1 Clients I/O Requests on each OSS Node Handle Client0 Read at 0:12 Handle Client1 Read at 0:12 Handle Client0 Read at 0:06 Handle Client2 Read at 0:25 5 High Performance Data Division

  6. LIOProf: Lustre IOProfiler Logging Services Enable RPC tracing to record the I/O activities of OSS nodes Available to super users / administrators Statistics Collection and Visualization Collect the statistical metrics and generate visualization plots Logs can be parsed offline LIOProf Components 6 High Performance Data Division

  7. LIOProf Logging Services Enable Lustre RPC (Remote Procedure Call) Tracing Configure parameter debug to be rpctrace log level Employ debug buffer to store RPC tracing logs in memory Launch background debug_daemon to drain the logs Overhead of LIOProf Logging Services Compare the efficiency of Benchmarks with/without enabling the LIOProf The performance difference is less than 1% 7 High Performance Data Division

  8. LIOProf Statistics Collection and Visualization Parse RPC traces for I/O activities information Log item time: time that the request has been handled RPC source: client that issues the request RPC operation code(opc): request type I/O statistics visualization Gather and organize the parsed output Create the gnuplot script for visualization 8 High Performance Data Division

  9. Case 1: Investigate MPIIO Performance over Lustre IOR Benchmark with MPI-IO API on Wolf Cluster 192 processes perform concurrent I/O in an interleaved access pattern on shared file IOR Config [FileSize: 768GB, BlockSize: 4MB, TransferSize: 4MB, Aggregators: 4] Lustre Config [4 OSTs, 6 Clients, Stripe Size: 4MB, Stripe Count: 4] Obdfilter-survey is employed to measure the maximum available bandwidth MVAPICH read operation performs 54.8% worse than obdfilter-survey Overall Performance of MVAPICH 9 High Performance Data Division

  10. Use LIOProf to Analyze Server I/O Bandwidth Each OST is able to deliver maximum available write bandwidth Read bandwidth is distributed to 4 aggregators on clients However, the aggregated bandwidth is lower than maximum bandwidth MVAPICH cannot obtain Lustre Stripe Info, each aggregator reads from multiple OSTs Write Read 10 High Performance Data Division

  11. Lustre-Aware CB (Collective Buffer) Read Algorithm Obtain Lustre Stripe Info, each aggregator reads from one OST Lustre-Aware read algorithm delivers near optimal bandwidth Lustre-Aware performs 104% better than original MVAPICH Each OST serves I/O requests at high ratio Overall Performance of Lustre-Aware CB Algorithm Lustre-Aware Read 11 High Performance Data Division

  12. Lustre-Aware CB Read Performance on Cori System Use IOR to launch 128 to 4096 processes on 128 Lustre Clients Read 16TB data from 96 Lustre OSTs Lustre-Aware performs 134% better than MVAPICH at 4096 Processes Read Performance on Cori System 12 High Performance Data Division

  13. Case 2: Measure Parallel HDF5 Overhead After optimization, MPIIO achieves maximum bandwidth IOR benchmark with HDF5 and MPIIO APIs 4 Processes perform I/O in an interleaved access pattern on shared file IOR Config [FileSize: 512GB, BlockSize: 4MB, TransferSize: 4MB, Aggregators: 4] Lustre Config [4 OSTs, 4 Clients, Stripe Size: 4MB, Stripe Count: 4] HDF5 performs worse than MPIIO, especially in read operation Comparison between MPIIO and HDF5 13 High Performance Data Division

  14. Using LIOProf to Reveal HDF5 I/O Activities in Read OSTs deliver high bandwidth constantly in MPIIO case In HDF5 case, each OST services both datasets and metadata I/O requests HDF5 metadata operations affect datasets I/O accesses MPIIO HDF5 14 High Performance Data Division

  15. HDF5 Collective Metadata and Datasets Optimizations Enable HDF5 Collective Metadata write/read in IOR Additionally, open all the datasets in the beginning, cache datasets metadata in memory HDF5-Coll_Meta&DataSet_Opt outperforms HDF5 and HDF5-Coll_Meta by 175.3% and 65.1% in read The overhead of metadata operations have been reduced significantly Overall Performance with Both Collective Metadata and Datasets Optimizations HDF5-Coll_Meta&DataSet_Opt 15 High Performance Data Division

  16. Conclusion Propose Lustre IO Profiler, called LIOProf, to track the I/O activities on Lustre Server LIOProf is useful in uncovering correlation info between I/O patterns and Lustre behavior Leverage LIOProf in two case studies Case 1: Design and implement Lustre-aware Collective Buffer Read Algorithm Case 2: Identify HDF5 overhead and improve its performance 16 High Performance Data Division

  17. Thank You and Questions 17 High Performance Data Division

  18. Two Case Studies with LIOProf Case 1: Investigate MPI-IO performance over Lustre Identify issue in MVAPICH read algorithm over Lustre Implement Lustre-Aware CB (Collective Buffer) Read Algorithm Case 2: Measure Parallel HDF5 overhead Observe considerable performance gap between HDF5 and MPI-IO cases Address the HDF5 overhead through enabling collective metadata and applying datasets optimization 18 High Performance Data Division

  19. Evaluation Environment Wolf Cluster Each node equips 64GB Memory, 36 CPU Cores Nodes are connected by Mellanox ConnectX QDR IB Cori Supercomputer 1,630 compute nodes with 30PB Lustre storage 32 CPU Cores and 128GB Memory per node Cray Aries high-speed interconnect with Dragonfly topology Software Configuration IOR Benchmark MPIIO and HDF5 APIs MVAPICH2-2.2b, Lustre version 2.7 HDF5-1.9.234(Parallel) 19 High Performance Data Division

  20. Optimize HDF5 by Collective Metadata Operation Enable HDF5 Collective Metadata write/read in IOR H5Pset_coll_metadata_write() & H5Pset_all_coll_metadata_ops() Collective Metadata Read: one process reads metadata and broadcasts HDF5-Coll_Meta delivers 175.3% higher read bandwidth than HDF5 Overall Performance with Collective Metadata Optimization I/O Activities in Read Operation 20 High Performance Data Division

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#